Skip to content

Real-Time Graphics and Display Pipelines

Goal: Understand how display pipelines affect real-time embedded applications, where latency hides, and how to architect a system that renders smoothly under load.

For Hands-On Practice

Related tutorials: DRM/KMS Test | Display Applications | PREEMPT_RT Latency


Your level display application works on the bench: the IMU reads tilt and a bubble moves on screen. But under CPU load, the bubble stutters and the image tears. The sensor loop still runs at the correct rate -- the problem is somewhere between "data ready" and "pixels visible."

This is a display pipeline problem, not a sensor problem.


1. VSync and Page Flipping

A display panel scans out pixels line by line at a fixed rate (typically 60 Hz = every 16.7 ms). The VBlank interval is the short gap between the last line of one frame and the first line of the next. Writing to the visible buffer during scan-out causes tearing -- the top half shows the old frame and the bottom half shows the new one.

Double buffering solves this:

  1. Render to the back buffer (invisible)
  2. Wait for VBlank
  3. Flip -- swap back and front buffer pointers
  4. Display scans out the new front buffer
graph LR
    A[Render Loop] -->|Draw| B[Back Buffer]
    B -->|Wait VBlank| C[Page Flip]
    C -->|Swap pointers| D[Front Buffer]
    D -->|Scan-out| E[Display Panel]

    style B fill:#FF9800,color:#fff
    style D fill:#4CAF50,color:#fff
    style E fill:#2196F3,color:#fff

The flip is atomic from the display's perspective: the panel always reads a complete, consistent frame. No tearing.


2. fbdev vs DRM/KMS for Real-Time

The legacy framebuffer interface (/dev/fb0) exposes a single memory-mapped buffer. There is no built-in page flip, no VSync notification, and no atomic mode setting. Writing to it during scan-out tears.

DRM/KMS provides:

  • Double/triple buffering with explicit page flip requests
  • VSync events delivered via file descriptor (pollable)
  • Atomic commits -- set mode, buffer, and plane in one operation
  • Hardware-scheduled flips -- the kernel waits for VBlank internally
Feature fbdev DRM/KMS
Page flip Manual (copy) Hardware-assisted
VSync notification None standard drmWaitVBlank / page flip event
Atomic commit No Yes (drmModeAtomicCommit)
Tear-free guarantee No Yes (with page flip)
Multi-plane support No Yes (overlay, cursor, primary)
Typical flip latency N/A (no flip) < 1 ms (hardware scheduled)
Future support Deprecated Active development

Rule: For any application where visual smoothness matters, use DRM/KMS. Reserve fbdev for quick prototypes where tearing is acceptable.


3. Sensor-to-Pixel Pipeline

In a real-time display application (e.g., an IMU-driven level indicator), data flows through multiple stages. Each stage adds latency:

graph LR
    A[IMU Read<br>~1 ms] -->|I2C/SPI| B[Filter<br>~0.1 ms]
    B --> C[Shared State<br>Update angle]
    C --> D[Render<br>~2-5 ms]
    D --> E[Page Flip<br>Wait VBlank]
    E -->|~0-16.7 ms| F[Display<br>Scan-out]

    style A fill:#9C27B0,color:#fff
    style D fill:#FF9800,color:#fff
    style F fill:#2196F3,color:#fff

Total input-to-display latency = sensor read + filter + render + VSync wait + scan-out.

Worst case at 60 Hz: 1 + 0.1 + 5 + 16.7 + 8 = ~31 ms (nearly two frames).

Best case: 1 + 0.1 + 2 + 0 + 0 = ~3 ms (render finishes just before VBlank).

The VSync wait is the largest variable. Triple buffering reduces stalls but adds one frame of latency. The engineering trade-off is smoothness vs responsiveness.


3A. Signal Processing for Sensor Fusion

The filter stage in the pipeline above is not arbitrary — it is grounded in signal processing theory. This section derives the mathematics behind the filters used in the IMU controller, ball balancing, and plate balancing tutorials.

Sampling Theorem (Nyquist)

A signal sampled at rate \(f_s\) can only faithfully represent frequencies up to \(f_s / 2\) (the Nyquist frequency):

\[f_s \geq 2 \cdot f_{max}\]

Any signal component above \(f_s / 2\) aliases — it appears as a lower-frequency ghost that cannot be distinguished from real data.

Example: An IMU sampled at 100 Hz captures signals up to 50 Hz. Motor vibrations at 80 Hz alias to \(|80 - 100| = 20\) Hz — a phantom low-frequency oscillation that no software filter can remove after sampling. The fix is either to sample faster or to add an analog anti-aliasing filter before the ADC.

Digital Low-Pass Filter (First-Order IIR)

The simplest digital filter is the first-order IIR (Infinite Impulse Response):

\[y[n] = \alpha \cdot x[n] + (1 - \alpha) \cdot y[n-1]\]

where \(\alpha \in (0, 1]\) controls the cutoff frequency. Small \(\alpha\) means heavy smoothing (low cutoff); \(\alpha = 1\) means no filtering.

The approximate cutoff frequency is:

\[f_c = \frac{f_s}{2\pi} \cdot \frac{\alpha}{1 - \alpha}\]
\(\alpha\) \(f_c\) at \(f_s = 100\) Hz Effect
0.01 0.16 Hz Very smooth, high lag
0.05 0.84 Hz Moderate smoothing
0.2 3.98 Hz Light smoothing
1.0 — (no filter) Raw signal

Complementary Filter Derivation

The IMU tutorials use a complementary filter to fuse gyroscope and accelerometer data. The derivation starts from the observation that the two sensors have complementary noise characteristics:

  • Gyroscope: Accurate for fast changes (high-frequency), but drifts over time (low-frequency error)
  • Accelerometer: Accurate for static tilt (low-frequency), but noisy and sensitive to vibration (high-frequency error)

The complementary filter applies a high-pass filter to the gyroscope and a low-pass filter to the accelerometer, then sums them:

\[\theta[n] = \alpha \cdot \bigl(\theta[n-1] + \omega[n] \cdot dt\bigr) + (1 - \alpha) \cdot \theta_{accel}[n]\]

where:

  • \(\theta[n-1] + \omega[n] \cdot dt\) is the gyroscope integration (short-term accurate)
  • \(\theta_{accel}[n]\) is the tilt angle from the accelerometer (long-term accurate)
  • \(\alpha\) is the filter coefficient

The coefficient \(\alpha\) is determined by the time constant \(\tau\) (the crossover between trusting the gyro and trusting the accelerometer):

\[\alpha = \frac{\tau}{\tau + dt}\]

Deriving \(\tau\) from \(\alpha\): For \(\alpha = 0.98\) and \(dt = 0.02\text{ s}\) (50 Hz loop):

\[\tau = \frac{\alpha \cdot dt}{1 - \alpha} = \frac{0.98 \times 0.02}{0.02} = 0.98\text{ s}\]

The crossover frequency is:

\[f_c = \frac{1}{2\pi\tau} = \frac{1}{2\pi \times 0.98} \approx 0.16\text{ Hz}\]

Below 0.16 Hz (slow drift), the accelerometer dominates. Above 0.16 Hz (quick movements), the gyroscope dominates. This is why \(\alpha = 0.98\) works well — it trusts the gyroscope for any motion faster than about 6 seconds per cycle, while the accelerometer corrects drift on a ~1-second timescale.

Brief Note: Kalman Filter

The complementary filter is actually a special case of the steady-state Kalman filter with fixed gain. A full Kalman filter dynamically adjusts its gain based on a state-space model and noise covariance matrices:

  1. Predict: Project state forward using the system model
  2. Update: Correct the prediction using the measurement, weighted by the Kalman gain

For IMU fusion, the Kalman filter can model gyro bias as a state variable and estimate it online — something the complementary filter cannot do. However, for most embedded applications where the gyro bias is approximately constant, the complementary filter's simplicity and fixed computational cost make it the better choice.

Info

Kalman filtering is a deep topic beyond this course's scope. For further reading, see Phil Kim, Kalman Filter for Beginners or the Wikipedia article on Kalman filter.

Image Moments for Centroid Detection

The plate-balancing tutorial uses OpenCV's cv2.moments() to find the ball's center position. Image moments are weighted sums over pixel coordinates:

\[M_{pq} = \sum_x \sum_y x^p \cdot y^q \cdot I(x, y)\]

where \(I(x,y)\) is the pixel intensity (or 1 for binary images). The centroid (center of mass) is:

\[\bar{x} = \frac{M_{10}}{M_{00}}, \quad \bar{y} = \frac{M_{01}}{M_{00}}\]
  • \(M_{00}\) is the total area (number of white pixels in a binary image)
  • \(M_{10}\) is the sum of x-coordinates, \(M_{01}\) is the sum of y-coordinates

In the tutorial code, this appears as:

M = cv2.moments(contour)
cx = int(M["m10"] / M["m00"])
cy = int(M["m01"] / M["m00"])

The moment calculation is \(O(n)\) in the number of contour pixels — fast enough for real-time use at 30+ fps on a Raspberry Pi.


4. RT Kernel Impact on Graphics

When you enable PREEMPT_RT to improve sensor loop determinism, you are also changing how the display pipeline behaves — and the effects are not always positive. PREEMPT_RT makes interrupt handlers preemptible, which is exactly what you want for sensor IRQs (your high-priority sensor thread can preempt them), but it also means GPU and DRM interrupts can be preempted by your sensor thread. This can introduce micro-delays in the display path that would not exist on a standard kernel. Understanding these interactions is essential for systems that need both deterministic sensing and smooth rendering.

PREEMPT_RT changes the kernel's behavior in ways that affect both sensor loops and display paths:

Change Effect on Sensors Effect on Display
Threaded interrupts Sensor IRQ has schedulable priority GPU IRQ becomes schedulable too
Sleeping spinlocks Sensor driver is preemptible GPU driver may see micro-delays
Priority inheritance Prevents priority inversion in sensor mutex Prevents inversion in display mutex
Deterministic scheduler Sensor loop jitter drops to < 50 us Render loop gets consistent scheduling

PREEMPT_RT helps the sensor side significantly. On the display side, it can introduce micro-latency in the GPU/DRM path because interrupts are now threaded and preemptible.

Architecture recommendation: Do not run the sensor loop and the render loop on the same core. Isolate them:

  • Sensor thread: high RT priority, pinned to isolated core
  • Render thread: normal priority, runs on non-isolated core with display IRQs
  • Shared state: lock-free or minimal-lock data exchange (e.g., atomic variable for angle)

5. CPU Core Partitioning

The fundamental tension in a real-time display system is that the sensor loop needs determinism (guaranteed scheduling within microseconds) while the display loop needs throughput (rendering a full frame within 16 ms). These are conflicting goals on a shared CPU core: the scheduler cannot simultaneously give one thread guaranteed low-latency wakeups and another thread large uninterrupted blocks of CPU time. The solution is to not share: dedicate specific cores to specific tasks, so they never compete for the same resource.

On a multi-core system (e.g., Raspberry Pi 4 with 4 cores), you can dedicate cores to specific tasks using kernel boot parameters:

Parameter Purpose
isolcpus=1-3 Remove cores 1-3 from general scheduler
nohz_full=1-3 Disable timer ticks on isolated cores
rcu_nocbs=1-3 Offload RCU callbacks from isolated cores
IRQ affinity (/proc/irq/*/smp_affinity) Pin hardware interrupts to specific cores
graph TD
    subgraph "Core 0 — General"
        DISP[Display + Render]
        IRQ[Hardware IRQs]
        SYS[systemd + services]
    end
    subgraph "Core 1 — Isolated RT"
        SENS[Sensor Read<br>RT priority 80]
        FILT[Filter + Control<br>RT priority 70]
    end
    subgraph "Core 2-3 — Isolated RT"
        LOG[Data Logger<br>RT priority 50]
        SPARE[Available for<br>additional RT tasks]
    end

    SENS -->|Shared state| DISP
    SENS --> FILT
    FILT -->|Shared state| DISP

    style SENS fill:#E91E63,color:#fff
    style FILT fill:#E91E63,color:#fff
    style DISP fill:#2196F3,color:#fff

Core 0 handles all non-RT work: display rendering, IRQs, system services. Cores 1-3 are isolated for deterministic sensor and control tasks. The shared state between sensor and display uses lock-free techniques (atomic writes, sequence counters).


6. DMA for Peripheral Access

When sensors are read at high rates or displays require large buffer transfers, the CPU access method matters:

Method CPU Load (1 kHz IMU) Latency Jitter
Polling (busy-wait) ~30-50% of one core Lowest Lowest
Interrupt-driven ~5-10% Low Low
DMA transfer ~1-2% Medium (setup overhead) Medium

When DMA matters:

  • High sample rates (> 1 kHz) where interrupt overhead accumulates
  • Large block transfers (display framebuffers over SPI, audio buffers)
  • Bulk sensor reads (accelerometer FIFO burst reads)

When DMA is overkill:

  • Low-rate sensors (temperature every second)
  • Small transfers (single register reads)
  • One-shot configuration writes

For a typical IMU at 100-500 Hz, interrupt-driven I2C/SPI is sufficient. For a SPI display refreshing a full frame at 30 fps, DMA frees the CPU for other work.


Quick Checks

  • Can you identify whether your display application uses single buffering, double buffering, or triple buffering?
  • What is the worst-case input-to-display latency in your sensor-to-pixel pipeline?
  • Are your sensor and render threads running on the same core or different cores?
  • Does your SPI display transfer use polling, interrupts, or DMA?

Mini Exercise

You have a Raspberry Pi 4 running a PREEMPT_RT kernel. Your application reads an IMU at 200 Hz over SPI, filters the data, and renders a level indicator on an HDMI display at 60 fps using DRM/KMS with double buffering.

Estimate the total input-to-display latency for best case and worst case:

Stage Best Case Worst Case
SPI read (IMU) ? ms ? ms
Filter computation ? ms ? ms
Render frame ? ms ? ms
VSync wait ? ms ? ms
Display scan-out to center ? ms ? ms
Total ? ms ? ms

Assume: SPI clock 1 MHz, 14 bytes per read, filter is 3-tap FIR, render is SDL2 software, display is 1920x1080 @ 60 Hz.


Key Takeaways

  • Tearing is a display pipeline problem solved by VSync-aligned page flipping, not by faster rendering.
  • Total input-to-display latency is the sum of every pipeline stage; the VSync wait often dominates.
  • Core partitioning (isolcpus) separates deterministic sensor work from best-effort display work, letting both perform well without interfering.

Hands-On

Try these in practice: