Skip to content

Graphics Stack in Embedded Linux

Goal: Understand when to use raw framebuffer, DRM/KMS, or a full graphics stack, and why embedded systems often avoid a window manager.

Related Tutorials

For hands-on practice, see: Framebuffer Basics | DRM/KMS Test | Single-App UI | Display Apps


You need a reliable product UI on a small LCD:

  • single fullscreen app
  • fast boot
  • no desktop environment

Choosing a graphics stack too heavy for the requirement causes slow startup and fragile deployments.


1. Three Levels of Graphics

Linux offers three fundamentally different approaches to putting pixels on a screen. Each trades complexity for capability. Understanding what each level gives you — and what it costs — is the key to choosing correctly for your product.

A) Raw Framebuffer (fbdev)

The framebuffer is the simplest graphics interface in Linux. You open /dev/fb0, write pixel data into it, and those pixels appear on screen. There is no window manager, no compositor, and no GPU involvement — just a direct memory-mapped path from your application to the display hardware. This makes it the fastest way to get something visible during early prototyping, and the easiest to understand when learning how Linux graphics work.

The trade-off is that fbdev is a legacy interface. The Linux kernel community has deprecated it in favor of DRM/KMS, which means new drivers and features go to DRM first. If you build a product on fbdev today, you may find that future kernel updates drop support for your display controller's fbdev driver.

Pros: minimal dependencies, easy to understand Cons: legacy, limited acceleration, less future-proof


B) DRM/KMS (Kernel Mode Setting)

DRM/KMS is the modern replacement for fbdev. Instead of a simple memory buffer, it gives you explicit control over display modes, timing, and hardware planes. You can request page flips synchronized to the display's vertical blanking interval, which eliminates tearing — something fbdev cannot do reliably. For a single-app embedded product, DRM/KMS with a "dumb buffer" (no GPU acceleration) provides a clean, tear-free display path with no window manager overhead.

The cost is API complexity. Where fbdev is "open file, write pixels," DRM requires you to enumerate connectors, configure CRTCs (display controllers), and manage buffer objects. This learning curve pays off in reliability and future-proofing — every active display driver in the kernel supports DRM, and new features (atomic commits, multi-plane composition) are DRM-only.

Pros: current standard, hardware aware, no window manager required Cons: more complex API, requires DRM knowledge


C) Full Graphics Stack (Wayland/X11 + Compositor)

When your product needs multiple windows, a cursor, drag-and-drop, or integration with UI toolkits like Qt or GTK, you need a compositor. Wayland (the modern protocol) or X11 (the legacy protocol) sits between your application and the display hardware, managing window placement, input routing, and GPU-accelerated compositing. This is what desktop Linux uses, and it is powerful — but that power comes with weight.

A compositor adds seconds to boot time, consumes tens to hundreds of megabytes of RAM, and introduces an extra layer of buffering between your application's render and the actual pixel output. For an embedded appliance that runs a single fullscreen application, this overhead buys you nothing. The compositor is waiting to manage windows that will never appear. This is why most embedded products avoid it.

Pros: rich UI, multi-window, hardware acceleration Cons: heavy, more latency, more points of failure


2. Embedded Trade-offs

On a desktop, the graphics stack is chosen for you — your distribution ships with a compositor and you never think about it. In embedded, you choose explicitly, because each level carries a different cost in boot time, RAM, CPU overhead, and long-term maintenance burden. A compositor that adds 5 seconds to boot and 100 MB of RAM usage might be invisible on a desktop, but on an appliance with a 10-second boot budget and 256 MB of RAM, it consumes half your resources before your application even starts.

The decision comes down to four factors:

  • Boot time — heavier stacks take longer to initialize
  • Reliability — more components mean more potential failure points
  • Maintenance cost — a compositor requires its own configuration, updates, and debugging
  • UI complexity — only use a heavy stack if you actually need its features

If you only need a single full-screen UI, DRM or framebuffer is often enough. Adding a compositor "just in case" is a common mistake that costs boot time and introduces failure modes you did not need.

Concrete Comparison

Approach Boot impact Memory usage CPU overhead Complexity
Raw framebuffer (/dev/fb0) None ~1 MB Minimal Low
DRM/KMS (dumb buffer) None ~2-4 MB Low Medium
Wayland + Weston +2-5 s boot ~50-100 MB Medium High
X11 + Desktop +5-15 s boot ~200+ MB High Very High

3. "No GUI" Still Needs Graphics

A common misconception is that removing the desktop environment means giving up on graphics entirely. In practice, most embedded Linux products with displays run without a desktop environment — but they still draw to the screen. The key distinction is between a desktop GUI (window manager, taskbar, multiple applications) and a display output (your application rendering directly to hardware).

Running without a desktop environment does not mean "no graphics." It means your application owns the display exclusively:

  • Render directly to framebuffer
  • Use DRM for a single fullscreen app
  • Avoid X11/Wayland entirely

This is the standard pattern in appliances, kiosks, and industrial panels. A factory HMI showing temperature graphs, a point-of-sale terminal showing a payment screen, and a digital sign showing a schedule — all of these render graphics without any compositor or window manager.

Common "no GUI" patterns in products:

  • PIL/Pillow → fbi → framebuffer — generate status images in Python, push to display. Used in: industrial panels, point-of-sale displays.
  • OpenCV → framebuffer — render camera feeds or data visualizations directly. Used in: machine vision HMIs.
  • DRM dumb buffer — write pixel data directly via DRM API. Used in: kiosk systems, digital signage.
  • Custom framebuffer driver — kernel driver for non-standard displays (LED matrices, e-ink). Used in: transportation displays, embedded instruments.

4. Decision Matrix (Practical)

When deciding which graphics level to use, start from the simplest option that meets your requirements and move up only when you hit a concrete limitation. Teams often over-specify the graphics stack early in a project — choosing Wayland because "we might need multiple windows later" — and then spend months debugging compositor issues they never needed. Start minimal. You can always add complexity later; removing it is much harder.

Use framebuffer when: - fastest prototype path matters - hardware acceleration is not required

Use DRM/KMS when: - you need long-term maintainability - you want modern kernel graphics path - single fullscreen app is enough

Use Wayland/X11 stack when: - multiple apps/windows are required - advanced UI framework dependencies demand it

Choosing Your Graphics Approach

graph TD
    A[Need display output?] -->|No| B[No graphics needed]
    A -->|Yes| C{Multiple windows?}
    C -->|Yes| D[Wayland/X11 + Compositor]
    C -->|No| E{Need hardware acceleration?}
    E -->|Yes| F[DRM/KMS with GPU]
    E -->|No| G{Quick prototype?}
    G -->|Yes| H[Raw Framebuffer — fbdev]
    G -->|No| I[DRM/KMS — future-proof]

UI Toolkit: Qt + EGLFS vs SDL2

The flowchart above helps you pick a kernel-level display path (fbdev, DRM/KMS, or compositor). But once you've landed on DRM/KMS for a single-app fullscreen product, there is still a separate decision: which application-level UI toolkit draws your pixels? The two realistic options for embedded HMIs are Qt with EGLFS and SDL2 with a KMS/DRM backend.

Qt + EGLFS is a full UI framework that renders directly via EGL on a KMS display — no X11 or Wayland compositor needed. Qt's EGLFS platform plugin opens the DRM device, sets the mode, and gives your application a GPU-accelerated OpenGL (or Vulkan) surface. This is the standard choice for dashboard UIs, settings menus, and touch-driven interfaces. Qt's widget and QML systems provide layout engines, animations, and input handling out of the box. If the product may later be ported to other embedded SoCs (STM32MP1, i.MX8, TI AM62x), Qt's cross-platform abstraction makes that migration straightforward.

SDL2 + KMS/DRM takes a different approach: it provides a minimal render loop (window, surface, input events) and lets you draw everything yourself. SDL2's KMS/DRM backend gives you a tear-free fullscreen surface without a compositor, and its footprint is a fraction of Qt's. This is the right choice when you are building custom-drawn gauges, game-loop-style rendering, real-time data visualizations, or any UI where you want full control over every pixel and every frame. The trade-off is that SDL2 provides no layout engine, no widget library, and no built-in UI components — you build those yourself or use a lightweight library like Dear ImGui.

Criterion Qt + EGLFS SDL2 + KMS/DRM
UI complexity High — widgets, QML, animations, touch Low — render loop only, draw-it-yourself
Footprint (runtime) ~30-80 MB ~2-5 MB
GPU required? Yes (EGL/OpenGL) Optional (can use software renderer)
Learning curve Steeper (Qt APIs, QML, build system) Shallow (simple C API)
Best for Menus, dashboards, settings, HMI products Gauges, data viz, games, custom rendering
Cross-platform portability Excellent (same code on multiple SoCs) Good (SDL2 runs everywhere, but UI is custom)
Buildroot/Yocto integration Mature but heavy (many packages) Lightweight (single package)

Hybrid pattern: Many production HMIs combine both approaches. Qt Quick (QML) handles the UI chrome — menus, status bars, touch navigation — while a custom OpenGL or Vulkan render area inside the QML scene draws real-time gauges, waveforms, or 3D views. This "best of both" pattern gives you Qt's layout and input handling where it saves time, and full GPU control where you need it.

Tip

Course lab recommendation: Start with SDL2 + KMS/DRM for the display tutorials — it has the smallest footprint and teaches you exactly what the hardware is doing. Move to Qt + EGLFS when you build a multi-screen dashboard in the project phase, where the layout engine and touch handling pay for themselves. See Qt Quick for Embedded Linux for a deeper reference on QML, property bindings, EGLFS configuration, and the Qt development workflow.


5. Display Interfaces — SPI, DSI, and HDMI

Everything above assumes HDMI output — the GPU renders into a buffer, the display controller scans it out via HDMI TMDS encoding, and the monitor shows pixels. But embedded products frequently use other display interfaces, each with fundamentally different bandwidth, driver architecture, and GPU involvement. Understanding these differences is essential for choosing the right display for your product.

SPI Displays

SPI TFT displays are the cheapest and simplest option for adding a screen to an embedded Linux device. A typical 3.5" HAT (320x480, ILI9486 controller) plugs into the GPIO header and uses SPI0 for pixel data. The kernel's fbtft framework exposes the panel as /dev/fb1.

The critical constraint is bandwidth. SPI uses three signals for display data: MOSI (data), SCLK (clock), and CS (chip select), plus a DC (data/command) GPIO to distinguish pixel data from controller commands.

Bandwidth calculation for a 320x480 RGB565 display at 30 FPS:

Parameter Value
Resolution 320 x 480 = 153,600 pixels
Color depth 16 bits (RGB565)
Bits per frame 153,600 x 16 = 2,457,600 bits
Target frame rate 30 FPS
Required bandwidth 2,457,600 x 30 = 73.7 Mbit/s
Available SPI bandwidth 32 MHz clock = 32 Mbit/s
Achievable full-frame FPS 32,000,000 / 2,457,600 = 13 FPS

The SPI bus cannot deliver 30 FPS of full-frame updates. This is a hard physical limit — no software optimization changes it. Partial updates (redrawing only changed regions) can improve perceived responsiveness, but the maximum full-frame rate is fixed by the SPI clock.

SPI signal flow:

CPU writes pixel data
┌──────────┐   MOSI  ┌──────────────┐
│  SPI     │────────►│  ILI9486     │──► LCD Panel
│  Master  │   SCLK  │  Controller  │
│  (CPU)   │────────►│              │
│          │   CS    │              │
│          │────────►│              │
└──────────┘         └──────────────┘
    │  DC GPIO
    └──────────────────►  (0=command, 1=data)

Common SPI display controller ICs: ILI9486 (320x480), ILI9341 (240x320), ST7789 (240x240 or 240x320). All follow the same MIPI DBI (Display Bus Interface) command protocol over SPI.

DSI (MIPI Display Serial Interface)

DSI is the standard panel interface in phones, tablets, and commercial embedded products. Instead of repurposing a general-purpose bus like SPI, DSI uses dedicated high-speed differential lanes optimized for display data. The Raspberry Pi has 2 DSI lanes, each running at approximately 1 Gbit/s.

Bandwidth calculation for an 800x480 RGB888 display at 60 FPS:

Parameter Value
Resolution 800 x 480 = 384,000 pixels
Color depth 24 bits (RGB888)
Bits per frame 384,000 x 24 = 9,216,000 bits
Target frame rate 60 FPS
Required bandwidth 9,216,000 x 60 = 553 Mbit/s
Available DSI bandwidth 2 lanes x ~1 Gbit/s = ~2 Gbit/s

DSI has bandwidth to spare. The display controller manages its own scan-out timing, and the GPU pushes frames through the DRM/KMS path — the same path used for HDMI. Auto-detection works via device tree: the kernel probes the DSI bus at boot, finds the panel, and configures it automatically.

HDMI

HDMI is covered in the sections above. For embedded products, HDMI is the easiest display to develop with (standard monitors, no special connectors) but is rarely used in deployed products — the connector is bulky, the cable is heavy, and the display is external.

GPU vs CPU Rendering Paths

The three interfaces fall into two fundamentally different rendering architectures:

SPI:  App → CPU render → RAM → fbtft → SPI DMA → Panel      (no GPU scan-out)
DSI:  App → GPU render → DRM buffer → Display ctrl → DSI → Panel  (full GPU)
HDMI: App → GPU render → DRM buffer → Display ctrl → HDMI → Panel (full GPU)

Why can't the GPU scan out over SPI? The GPU's display controller is hardwired to its output ports (HDMI, DSI). It reads framebuffer memory and feeds pixels to these ports at precise timing intervals synchronized with the panel's refresh rate. SPI is a general-purpose peripheral controlled by the CPU — the GPU has no connection to it. Even if the GPU renders a frame into RAM, the CPU must explicitly read that RAM and push bytes through the SPI peripheral. This is why SPI displays always consume CPU time for display updates, regardless of whether you use OpenGL for rendering.

DSI and HDMI share the entire GPU pipeline. The GPU renders into a DRM buffer, the display controller scans it out during VBlank, and the only difference is the final physical encoding — DSI differential signaling vs HDMI TMDS. VSync, page flipping, hardware compositing, and OpenGL ES all work identically on both.

CPU-Only Solutions for SPI Displays

Since SPI displays cannot use the GPU scan-out path, rendering is done entirely in software:

  • PIL/Pillow — generate images in Python, convert to RGB565, write to /dev/fb1
  • OpenCV — render camera feeds or data visualizations, push to framebuffer
  • LVGL — lightweight graphics library designed for MCU and SPI displays
  • Partial updates (dirty rectangles) — instead of sending the full 307 KB frame, track which screen regions changed and send only those rows. This can reduce SPI transfer from 307 KB/frame to 10-50 KB/frame for typical status UIs, dramatically improving perceived responsiveness

Linux Driver Architecture

Layer SPI Display DSI Display HDMI
Legacy fbtft (/dev/fb1) N/A fbdev (/dev/fb0)
Modern DRM tiny / panel-mipi-dbi DRM panel driver DRM/KMS
Touch XPT2046 (SPI, resistive) Goodix GT911 (I2C, capacitive) N/A
GPU acceleration None Full (EGL, OpenGL ES) Full (EGL, OpenGL ES)
VSync / page flip None DRM page flip DRM page flip

Decision Table: Choosing a Display Interface

Criterion SPI Display DSI Display HDMI
Bandwidth ~32 Mbit/s ~2 Gbit/s ~18 Gbit/s (HDMI 2.0)
GPU support None Full Full
Max resolution (practical) 480x320 1280x800 4K
Max FPS (full frame) ~13 60+ 60+
Cost (panel) $5-15 $20-50 External monitor
Power consumption Low (~100 mW) Medium (~300 mW) High (monitor dependent)
Connector GPIO header (HAT) Ribbon cable (DSI port) HDMI socket
Touch Optional (SPI resistive) Common (I2C capacitive) External USB
Typical use case Status panels, simple HMI Product UI, dashboard Development, kiosk
Setup complexity Overlay + fbtft Overlay or auto-detect Plug and play
Hands-On

See these tutorials for practical experience with each interface:

  • SPI TFT Display — connect a 3.5" SPI panel, measure bandwidth limits
  • DSI Display — connect a 5" DSI panel, compare GPU vs CPU paths

6. Common Pitfalls

Each of these mistakes has cost real product teams weeks of debugging. They are easy to avoid if you know what to look for:

  • Mixing fbdev assumptions on DRM-only systems — Modern kernels may expose /dev/fb0 as a compatibility shim over DRM, but it does not behave identically. Page flipping, mode setting, and buffer management work differently. If your kernel uses a DRM driver, use the DRM API directly.
  • Ignoring pixel format/stride details — The display hardware may expect RGB565 while your renderer outputs ARGB8888, or the stride (bytes per row) may include padding. A mismatch produces garbled or shifted images. Always query the actual format and stride from the driver. The most common pixel formats in embedded Linux:
Format Bytes/Pixel Channel Layout Typical Use
RGB565 2 5 bits R, 6 bits G, 5 bits B Low-memory displays, MCU LCDs
RGB888 3 8 bits each, no alpha Framebuffer default on some SoCs
ARGB8888 4 8-bit alpha + 8 bits each RGB DRM default, compositing
BGR888 3 8 bits each, blue first Some camera/display pipelines

Stride is the number of bytes per row, which may be larger than width x bytes_per_pixel due to alignment padding. For example, an 800-pixel-wide RGB888 framebuffer has 800 x 3 = 2400 bytes of pixel data per row, but the driver may round stride up to 2432 (64-byte aligned). If you assume stride equals width x bpp, every row after the first will be shifted by the padding amount — producing a diagonal smear. - Adding a compositor when not needed — Every additional layer is a potential failure point. If your product runs one fullscreen app, a compositor adds boot time and complexity with no benefit. - No fallback path if display init fails — If the display is unplugged or the driver probe fails, your application should still run (logging data, serving network requests). Design the display as an optional output, not a hard dependency.


7. Display Scan-Out and Tearing

Block 1 covered which graphics path to use. This section covers what happens when your application writes pixels while the display is reading them — the tearing problem — and the hardware mechanism that solves it.

How the Display Reads Memory

A typical LCD panel refreshes at 60 Hz, meaning the display controller scans out the entire framebuffer every 16.7 ms. For a 480-row display, that is roughly 29 rows per millisecond. The scan starts at the top-left pixel, reads left-to-right, top-to-bottom, and after the last row there is a short VBlank (vertical blanking) interval before the next frame begins.

Scan-out timeline (60 Hz, 480 rows):

  0 ms   ┌─ Row 0 ──────────────────────┐
  ...     │  ~29 rows/ms                 │
  8 ms    │  Row ~232 (midscreen)        │   ← active scan
  ...     │                              │
 16.0 ms  └─ Row 479 ───────────────────┘
 16.0–16.7 ms  ── VBlank interval ──         ← safe to swap buffers
 16.7 ms  ┌─ Row 0 (next frame) ────────┐

During the active scan, the display controller is reading from memory. If your application writes to the same buffer simultaneously, the display shows part of the old frame and part of the new frame.

Display Timing Mathematics

The scan-out timing above is governed by the pixel clock, which determines how fast pixels are pushed to the display. The pixel clock frequency is:

\[f_{pixel} = (H_{active} + H_{blank}) \times (V_{active} + V_{blank}) \times f_{refresh}\]

where \(H_{blank}\) and \(V_{blank}\) include front porch, sync pulse, and back porch intervals.

Example: An 800x480 display with typical blanking (256 horizontal, 45 vertical) at 60 Hz:

\[f_{pixel} = (800 + 256) \times (480 + 45) \times 60 = 1056 \times 525 \times 60 = 33.26\text{ MHz}\]

From the pixel clock, you can derive the key timing budgets:

Quantity Formula Value (800x480 @ 60 Hz)
Frame time \(t_{frame} = 1 / f_{refresh}\) 16.67 ms
VBlank duration \(t_{vblank} = \frac{V_{blank}}{V_{total} \times f_{refresh}} \times \frac{1}{f_{refresh}} \times f_{refresh}\) \(\frac{45}{525} \times 16.67 \approx 1.43\text{ ms}\)
Active scan time \(t_{active} = t_{frame} - t_{vblank}\) 15.24 ms
Render budget \(\approx 0.7 \times t_{frame}\) 11.67 ms

The render budget (70% of frame time) leaves margin for OS scheduling jitter and VSync synchronization overhead. If your render consistently takes more than 70% of the frame time, you risk occasional frame drops.

Frame Drops and the Nyquist Limit for Motion

If the render time \(T_r\) occasionally exceeds \(t_{frame}\), the frame is dropped and the previous frame is displayed again. For smooth animation, you need \(P(T_r > t_{frame})\) to be negligible.

There is also a Nyquist constraint on displayed motion: to faithfully display motion at frequency \(f\) Hz, you need \(f_{refresh} \geq 2f\). At 60 Hz, the display can represent motion up to 30 Hz — a ball bouncing faster than 30 cycles per second would appear to stutter or alias. This is rarely a concern for embedded HMIs but matters for game-loop-style rendering.

What Tearing Looks Like

When your application writes faster than the display scans, the display shows a tear line — a horizontal discontinuity where the old and new content meet:

┌──────────────────────────┐
│  NEW FRAME (green)       │  ← your app already wrote here
│                          │
├─ ─ ─ TEAR LINE ─ ─ ─ ─ ─┤  ← display scan position at the moment of write
│  OLD FRAME (red)         │  ← display is still reading the old data
│                          │
└──────────────────────────┘

The tear line moves depending on the timing relationship between your writes and the scan. If you write continuously, you may see multiple tear lines.

VSync and Page Flipping

The solution is double buffering with VSync:

  1. Allocate two framebuffers: a front buffer (currently displayed) and a back buffer (where your app draws)
  2. Your application renders the next frame into the back buffer
  3. At VBlank (when the display finishes scanning the last row), the hardware atomically swaps front and back — this is a page flip
  4. The display now scans the new front buffer while your app draws into the old one

DRM/KMS provides this mechanism via drmModePageFlip() with the DRM_MODE_PAGE_FLIP_EVENT flag. The kernel schedules the swap to coincide with VBlank, guaranteeing tear-free output. Framebuffer (fbdev) has no reliable VSync mechanism — this is one of the key reasons DRM is preferred.

The framebuffer_flood Experiment

The slides demonstrate tearing with a Python script that writes solid-color frames (red, green, blue) to /dev/fb0 as fast as possible, without any synchronization:

# Conceptual: flood framebuffer with alternating solid colors
colors = [RED, GREEN, BLUE]
while True:
    for color in colors:
        write_entire_framebuffer(color)  # no VSync, no page flip

What to observe: Moving horizontal tear lines where two colors meet. The tear position shifts each frame because the write speed and scan speed are not synchronized. This experiment proves why raw framebuffer writes without VSync produce visible artifacts — and why DRM page flipping exists.

"sleep() Is Not VSync"

A common attempt to avoid tearing is time.sleep(0.016) (sleep for one frame period). This does not work as VSync, for five reasons:

  1. CPU throttling — the governor may slow the clock, stretching your 16 ms to 20+ ms
  2. Sleep precisiontime.sleep() guarantees at least the requested delay, not exactly that delay; actual sleep may be 1-3 ms longer
  3. OS scheduling — another process may preempt yours after the sleep returns, adding milliseconds before your write
  4. Refresh rate mismatch — the display may run at 59.94 Hz (NTSC legacy) rather than exactly 60 Hz, causing gradual drift
  5. Cumulative drift — even small errors accumulate over hundreds of frames, eventually placing your write mid-scan

The only correct synchronization is hardware VBlank signaling via DRM page flip. Software timing approximations always produce occasional tearing under real-world conditions.


Quick Checks (In Practice)

  • is display path deterministic at boot?
  • do you control mode setting explicitly?
  • can app recover if display disconnects/reconnects?
  • does startup still meet boot budget?

Mini Exercise

Given "single fullscreen UI + < 10s boot + remote updates", select stack and justify in 5 lines.

Additional constraints to consider:

  • The device has 256 MB of RAM total
  • The product must operate reliably for 5+ years without on-site maintenance
  • The display cable may be disconnected during field installation — the application must continue logging data even if the display is absent

Key Takeaways

  • Framebuffer is simple but legacy.
  • DRM/KMS is the modern low-level choice.
  • Full stacks are powerful but heavy.
  • Embedded systems often use single-app fullscreen pipelines.

Hands-On

Try these in practice: