Skip to content

Lesson 6: Graphics Applications & Profiling

Óbuda University -- Linux in Embedded Systems

"The display works but stutters under load. Why?"


Problem First

Your level display application works on the bench: the IMU reads tilt and a bubble moves on screen.

But under CPU load, the bubble stutters and the image tears. The sensor loop still runs at the correct rate.

The problem is somewhere between "data ready" and "pixels visible."

This is a display pipeline problem, not a sensor problem.


Today's Map

  • Block 1 (45 min): Display pipeline: how displays work, VSync/page flip, double/triple buffering, fbdev vs DRM/KMS, sensor-to-pixel pipeline, latency breakdown.
  • Block 2 (45 min): RT kernel plus display: PREEMPT_RT and graphics, CPU core partitioning, isolcpus and IRQ affinity, DMA for peripherals, cyclictest under load, architecture patterns.

How a Display Works

A display panel scans out pixels line by line at a fixed rate.

At 60 Hz, one full frame is scanned every 16.7 ms.

  Line 0   ████████████████████████████████  <-- scan starts here
  Line 1   ████████████████████████████████
  ...
  Line 539 ████████████████████████████████
            --------------------------------
            VBlank interval (~1-2 ms)         <-- gap between frames
  Line 0   ████████████████████████████████  <-- next frame starts

The VBlank interval is the short gap between the last line of one frame and the first line of the next.


What Is Tearing?

If you write new pixel data while the display is scanning, the top half shows the old frame and the bottom half shows the new one.

  +--------------------------------+
  |                                |
  |  OLD FRAME (already scanned)  |
  |                                |
  |================================| <-- scan position when buffer changed
  |                                |
  |  NEW FRAME (written too early) |
  |                                |
  +--------------------------------+

This visible seam is called tearing. It happens because the write and the scan-out are not synchronized.


VSync — The Synchronization Point

VSync = synchronize buffer updates to the VBlank interval.

The rule: never change the displayed buffer while the panel is scanning it.

Wait for VBlank, then swap. The panel always reads a complete, consistent frame.

  Time -->
  |  Render  |  Wait  |  Flip  |  Render  |  Wait  |  Flip  |
  |  frame   | VBlank |  ptrs  |  frame   | VBlank |  ptrs  |
  |__________|________|________|__________|________|________|
             ^                            ^
          VBlank                       VBlank

Double Buffering

Double buffering is the mechanism that makes VSync possible.

Two buffers exist in memory:

Buffer Role CPU Access Display Access
Back buffer Being rendered Write None
Front buffer Being displayed None Read (scan-out)

The render loop writes to the back buffer. At VBlank, the pointers swap. No data is copied -- only the pointer changes.


Double Buffering — Step by Step

  Step 1: Render to back buffer        Step 2: Wait for VBlank
  +----------+    +----------+         +----------+    +----------+
  | Back     |    | Front    |         | Back     |    | Front    |
  | [drawing]|    | [display]|  --->   | [done]   |    | [display]|
  +----------+    +----------+         +----------+    +----------+
   CPU writes      Panel reads          CPU idle        Panel reads

  Step 3: Flip (swap pointers)         Step 4: Render next frame
  +----------+    +----------+         +----------+    +----------+
  | NEW Front|    | NEW Back |         | Front    |    | Back     |
  | [display]|    | [free]   |  --->   | [display]|    | [drawing]|
  +----------+    +----------+         +----------+    +----------+
   Panel reads     CPU can write        Panel reads     CPU writes

Atomic from the display's perspective: zero tearing.


Triple Buffering — The Trade-Off

Triple buffering adds a third buffer so the CPU never stalls waiting for VBlank.

  +----------+    +----------+    +----------+
  | Buffer A |    | Buffer B |    | Buffer C |
  | [display]|    | [ready]  |    | [drawing]|
  +----------+    +----------+    +----------+
   Panel reads     Queued next     CPU writes
Property Double Buffering Triple Buffering
Max latency 1 frame (16.7 ms) 2 frames (33.4 ms)
CPU stall on VSync Yes (waits) No (extra buffer)
Memory usage 2x framebuffer 3x framebuffer
Use case Low latency Smooth throughput

Trade-off: smoothness vs responsiveness.


fbdev — The Legacy Interface

The framebuffer device (/dev/fb0) exposes a single memory-mapped buffer.

// Pseudocode: fbdev memory-mapped framebuffer access
int fd = open("/dev/fb0", O_RDWR);
// mmap: offset 0, length = screen_size, PROT_READ|PROT_WRITE, MAP_SHARED
char *fb = mmap(NULL, screen_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
memcpy(fb, pixels, screen_size);  // Write = instant display (may tear)
Property fbdev
Page flip None (manual copy)
VSync notification No standard API
Atomic commit No
Tear-free No
Status Deprecated

Writing to the mapped buffer during scan-out will tear. No way around it with fbdev alone.


DRM/KMS — The Modern Interface

DRM = Direct Rendering Manager. KMS = Kernel Mode Setting.

DRM/KMS provides hardware-assisted display pipeline control.

  Userspace                    Kernel (DRM/KMS)          Hardware
  +----------------+          +------------------+       +--------+
  | Render to      |  ioctl   | Schedule flip    | IRQ   | Panel  |
  | dumb buffer    |--------->| at next VBlank   |------>| scans  |
  |                |  event   | Notify complete  |       | out    |
  |                |<---------|                  |       |        |
  +----------------+          +------------------+       +--------+

The kernel handles the timing. You just say "flip when ready."


fbdev vs DRM/KMS Comparison

Feature fbdev DRM/KMS
Page flip Manual (copy) Hardware-assisted
VSync notification None drmWaitVBlank / event
Atomic commit No Yes
Tear-free guarantee No Yes (with page flip)
Multi-plane support No Yes

Rule: For any application where visual smoothness matters, use DRM/KMS.

Reserve fbdev for quick prototypes where tearing is acceptable.


The Sensor-to-Pixel Pipeline

In a real-time display application (e.g., IMU-driven level indicator), data flows through multiple stages. Each stage adds latency.

  +----------+     +--------+     +--------+     +---------+     +---------+
  |   IMU    |---->| Filter |---->| Shared |---->| Render  |---->|  Page   |
  |   Read   |     |        |     | State  |     |  Frame  |     |  Flip   |
  | ~1 ms    |     | ~0.1ms |     |        |     | ~2-5 ms |     | 0-17 ms |
  +----------+     +--------+     +--------+     +---------+     +---------+
       |                                                               |
       |  I2C/SPI                                              Display scan-out
       |                                                         ~8 ms to center

Total input-to-display latency = sum of all stages.


Latency Breakdown — Best vs Worst Case

Stage Best Case Worst Case Notes
IMU read (SPI) 1 ms 1 ms Fixed by clock rate
Filter 0.1 ms 0.1 ms Deterministic
Render 2 ms 5 ms Depends on scene
VSync wait 0 ms 16.7 ms Largest variable
Scan-out to center 0 ms 8 ms Half frame time
Total ~3 ms ~31 ms Nearly 2 frames

The VSync wait dominates. If you just missed VBlank, you wait a full frame period.

Best case: render finishes just before VBlank. Worst case: render finishes just after VBlank.


Why VSync Wait Dominates

  VBlank        VBlank        VBlank        VBlank
    |             |             |             |
    v             v             v             v
  --|-------------|-------------|-------------|---> time
       16.7 ms       16.7 ms       16.7 ms

  Case A (best):  Render done here |         Flip!
                                   ^--- 0 ms wait

  Case B (worst): Render done here  |                    Flip!
                                    ^--- 16.7 ms wait

You cannot control when the render finishes relative to VBlank.

This is why the VSync wait is 0 to 16.7 ms -- it is purely a timing alignment issue.


Block 1 Summary

  • Tearing is caused by writing to the display buffer during scan-out
  • VSync + double buffering = tear-free display
  • DRM/KMS provides hardware-assisted page flipping; fbdev cannot
  • Sensor-to-pixel latency is the sum of all pipeline stages
  • VSync wait is the largest variable: 0 to 16.7 ms at 60 Hz
  • Triple buffering trades latency for throughput

Block 2 — RT Kernel and Display

"Which change helps more -- RT kernel or CPU isolation?"


PREEMPT_RT and Graphics — Not All Good News

PREEMPT_RT improves sensor loop determinism. But it also changes how the display pipeline behaves.

Key insight: PREEMPT_RT makes interrupt handlers preemptible -- including GPU and DRM interrupts.

Your high-priority sensor thread can now preempt the display interrupt handler.

This can introduce micro-delays in the display path that would not exist on a standard kernel.


RT Kernel Effects — Both Sides

Change Sensor Effect Display Effect
Threaded IRQs Sensor IRQ has schedulable priority GPU IRQ preemptible
Sleeping spinlocks Sensor driver is preemptible GPU driver sees micro-delays
Priority inheritance Prevents sensor mutex inversion Prevents display mutex inversion
Deterministic scheduler Jitter < 50 us Consistent render scheduling

PREEMPT_RT helps sensors significantly. On the display side, it can introduce micro-latency because GPU/DRM interrupts are now preemptible.


The Fundamental Conflict

The sensor loop needs determinism: guaranteed scheduling within microseconds.

The display loop needs throughput: render a full frame within 16.7 ms.

  Sensor thread (RT prio 80):
  |--run--|--sleep--|--run--|--sleep--|--run--|

  Render thread (normal prio):
  |=======render========|flip|=======render========|flip|

  What happens when they share a core:
  |==render==|PREEMPT|==render===|PREEMPT|==render==|flip|
              sensor              sensor
              runs                runs
              here                here

On a shared core, the sensor preempts the renderer. The frame may not finish before VBlank.


The Solution: CPU Core Partitioning

Do not share. Dedicate specific cores to specific tasks.

  +---------------------------------------------------+
  |  Raspberry Pi 4 — 4 Cores                         |
  |                                                    |
  |  Core 0 (General)     Core 1 (Isolated RT)        |
  |  +-----------------+  +---------------------+     |
  |  | Display + Render|  | Sensor Read (p=80)  |     |
  |  | Hardware IRQs   |  | Filter + Control    |     |
  |  | systemd, logs   |  | (p=70)              |     |
  |  +-----------------+  +---------------------+     |
  |                                                    |
  |  Core 2 (Isolated)    Core 3 (Isolated)            |
  |  +-----------------+  +---------------------+     |
  |  | Data Logger     |  | Spare / additional  |     |
  |  | (p=50)          |  | RT tasks            |     |
  |  +-----------------+  +---------------------+     |
  +---------------------------------------------------+

isolcpus — Removing Cores from the Scheduler

This builds on the CPU isolation concepts from Lesson 7 (Real-Time Systems). Review those slides if you need a refresher on why isolation matters for determinism.

Kernel boot parameters control core partitioning:

Parameter Purpose
isolcpus=1-3 Remove cores 1-3 from general scheduler
nohz_full=1-3 Disable timer ticks on isolated cores
rcu_nocbs=1-3 Offload RCU callbacks from isolated cores
# In /boot/cmdline.txt (Raspberry Pi):
... isolcpus=1-3 nohz_full=1-3 rcu_nocbs=1-3

After boot, the kernel will only schedule tasks on Core 0 unless you explicitly pin a task to an isolated core.


Pinning Tasks to Isolated Cores

Use taskset to pin RT threads to isolated cores:

# Pin sensor thread to Core 1, RT priority 80
sudo taskset -c 1 chrt -f 80 ./sensor_loop

# Pin data logger to Core 2, RT priority 50
sudo taskset -c 2 chrt -f 50 ./data_logger

# Render thread stays on Core 0 (default scheduler)
./render_app

Or from C code:

cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(1, &cpuset);  // Pin to Core 1
pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset);

IRQ Affinity — Controlling Interrupts

Pin hardware interrupts to specific cores so they do not disturb isolated RT tasks:

# Find the IRQ number for SPI (sensor)
cat /proc/interrupts | grep spi

# Pin SPI interrupt to Core 1 (where sensor runs)
echo 0x2 > /proc/irq/42/smp_affinity  # 0x2 = binary 0010 = Core 1

# Pin GPU/display interrupt to Core 0
echo 0x1 > /proc/irq/56/smp_affinity  # 0x1 = binary 0001 = Core 0
IRQ Pin To Reason
SPI (sensor) Core 1 (isolated) Low-latency wakeup
GPU / DRM Core 0 (general) Keep with render thread
USB, Ethernet Core 0 (general) Non-RT traffic

Shared State Between Cores

Sensor (Core 1) and Render (Core 0) must exchange data. Use lock-free techniques:

  Core 1 (Sensor)             Core 0 (Render)
  +------------------+        +------------------+
  | Read IMU         |        | Read shared_angle|
  | Filter           |        | Render frame     |
  | atomic_store(    | -----> | using angle      |
  |   shared_angle)  |        | Page flip        |
  +------------------+        +------------------+
// Sensor side (Core 1):
atomic_store(&shared_angle, filtered_angle);

// Render side (Core 0):
float angle = atomic_load(&shared_angle);

No mutex. No lock. No priority inversion. The render thread never blocks the sensor thread.


DMA for Peripheral Access

When sensors are read at high rates or displays need large transfers, the CPU access method matters:

Method CPU Load (1 kHz IMU) Latency Jitter
Polling (busy-wait) ~30-50% of core Lowest Lowest
Interrupt-driven ~5-10% Low Low
DMA transfer ~1-2% Medium Medium

DMA = Direct Memory Access. The hardware moves data without CPU involvement.


When DMA Matters vs Overkill

Use DMA when:

  • High sample rates (> 1 kHz) where interrupt overhead accumulates
  • Large block transfers (display framebuffers over SPI)
  • Bulk sensor reads (accelerometer FIFO burst reads)

Skip DMA when:

  • Low-rate sensors (temperature every second)
  • Small transfers (single register reads)
  • One-shot configuration writes

Rule of thumb: For IMU at 100-500 Hz, interrupt-driven SPI is sufficient. For SPI display at 30 fps, DMA frees the CPU for other work.


DMA Transfer — How It Works

  Without DMA:                    With DMA:
  +-----+     +-----+            +-----+     +-----+
  | CPU |<--->| SPI |            | CPU |     | SPI |
  |     |     | dev |            |     |     | dev |
  +-----+     +-----+            +--+--+     +--+--+
  CPU reads each byte                |           |
  one at a time.                     |  +-----+  |
  CPU is 100% busy                   |  | DMA |  |
  during transfer.                   |  | eng.|  |
                                     |  +--+--+  |
                                     |     |     |
                                  1.Setup  2.Transfer
                                  (CPU)    (no CPU)
                                           3.IRQ done

CPU sets up the transfer, then is free to do other work. DMA engine handles the byte-by-byte movement.


cyclictest — Measuring Scheduling Latency

cyclictest measures the time between when a thread should wake and when it actually wakes.

# Measure scheduling latency on Core 3:
sudo cyclictest -t1 -p99 -a3 -i1000 -l100000
Flag Meaning
-t1 One thread
-p99 RT priority 99 (highest)
-a3 Pin to CPU 3
-i1000 1000 us interval (1 kHz)
-l100000 100,000 loops

Output: min, avg, max latency. The max is what matters.


Running cyclictest Under Load

Always test under stress. Idle system latency is meaningless.

# Terminal 1: Generate CPU + I/O stress
stress-ng --cpu 4 --io 2 --vm 1 \
    --vm-bytes 128M --timeout 120s &

# Terminal 2: Measure while stressed
sudo cyclictest -t1 -p99 -a3 -i1000 \
    -l100000 -h400 > histogram.txt

The -h400 flag creates a histogram with 400 us buckets. This tells you not just the max, but the shape of the latency distribution.


Reading a cyclictest Histogram

  Latency (us)   |  Count
 ----------------+-------------------------------------------
      0 -  10    |  ################################  89,200
     10 -  20    |  #####                              4,100
     20 -  50    |  ###                                2,500
     50 - 100    |  ##                                 1,800
    100 - 200    |  |                                    350
    200+         |  .                                     50

Good histogram: tall and narrow. Most samples cluster near minimum. Short tail.

The tail determines your worst-case guarantee. A single spike at 500 us means your guarantee is 500 us, regardless of the 50 us average.


Good vs Bad Histograms

  GOOD (PREEMPT_RT + isolcpus):       BAD (Standard kernel, shared cores):

  ################################     ########
  ###                                  ######
  ##                                   #####
  #                                    ####
  |                                    ###
                                       ##
                                       #
                                       |  <-- long tail = missed deadlines
  |---|---|---|---|---|---|---|          |---|---|---|---|---|---|---|---|---|
  0  20  40  60  80 100    us          0  50 100 200 400 800 1600    us

Good: tight cluster, short tail. Predictable. Bad: wide spread, long tail. Cannot make guarantees.


The Prediction Question

Which change helps more for a sensor-display system?

A) Switching from standard kernel to PREEMPT_RT kernel B) Keeping standard kernel but adding CPU isolation (isolcpus)

Think about it before the next slide.


Testing the Prediction

Four configurations to test with cyclictest:

Config Kernel CPU Isolation Expected Max Latency
1 Standard None ~1-10 ms
2 Standard isolcpus=3 ~0.5-2 ms
3 PREEMPT_RT None ~50-150 us
4 PREEMPT_RT isolcpus=3 ~20-80 us
# Config 1: Standard, no isolation
sudo cyclictest -t1 -p99 -a0 -i1000 -l100000

# Config 4: RT kernel + isolated core
sudo cyclictest -t1 -p99 -a3 -i1000 -l100000

Interpreting the Results

  Max latency (us, log scale):

  Config 1: |======================================| ~5000 us
  Config 2: |==============|                         ~1500 us
  Config 3: |==|                                     ~120 us
  Config 4: |=|                                      ~50 us

Answer: PREEMPT_RT helps far more than CPU isolation alone.

  • RT kernel reduces worst case by ~40x (5000 -> 120 us)
  • CPU isolation reduces it by ~3x (5000 -> 1500 us)
  • Combined gives the best result (~50 us)

CPU isolation is good. RT kernel is essential. Use both.


Why RT Kernel Wins

CPU isolation removes scheduler contention but does not fix:

  • Non-preemptible interrupt handlers (can block for ms)
  • Spinlock-held critical sections
  • RCU callback storms

PREEMPT_RT fixes all three:

  • Interrupt handlers become schedulable threads
  • Spinlocks become sleeping mutexes
  • RCU callbacks are offloaded

Isolation reduces competition. RT kernel reduces non-preemptible time. They address different problems.


The Complete Architecture

  +-----------------------------------------------------------+
  |  KERNEL: PREEMPT_RT   BOOT: isolcpus=1-3 nohz_full=1-3   |
  +-----------------------------------------------------------+
  |                                                             |
  |  Core 0 (General)              Core 1 (Isolated, RT)       |
  |  +------------------------+    +------------------------+  |
  |  | Render thread (normal) |    | Sensor thread (p=80)   |  |
  |  | DRM/KMS page flip      |<---| atomic_store(angle)    |  |
  |  | GPU + display IRQs     |    | SPI IRQ pinned here    |  |
  |  | systemd, networking    |    | Timer tick disabled     |  |
  |  +------------------------+    +------------------------+  |
  |                                                             |
  |  Core 2-3 (Isolated, RT)                                   |
  |  +------------------------------------------------------+  |
  |  | Data logger, control, spare RT tasks                  |  |
  |  +------------------------------------------------------+  |
  +-----------------------------------------------------------+

Profiling Your Own Code

You know how to measure jitter with cyclictest. But when your project is slow, you need to find where in your code the time goes.

Three tools, three questions:

Tool Question Use When
mpstat -P ALL 1 Which cores are busy? Always — first tool
perf record -g -p PID Which functions eat CPU? CPU-bound (high %)
strace -c -p PID Which syscalls block? I/O-bound (low CPU, slow)

Live Demo: perf record → perf report

# Step 1: Start your app
sudo SDL_VIDEODRIVER=kmsdrm ./level_sdl2 &

# Step 2: Record 10 seconds of CPU samples
sudo perf record -g -p $(pgrep level_sdl2) -- sleep 10

# Step 3: See where time goes
sudo perf report
  Overhead  Symbol
  --------  ------------------------------------------
    32.50%  render_horizon     ← hot function
    18.20%  memcpy             ← unnecessary copies?
    12.80%  imu_read_spi       ← expected
     9.40%  SDL_RenderPresent  ← flip/VSync wait

Rule: Optimize the top line first. Ignore anything below 5%.


Live Demo: strace -c

When CPU is low but app is still slow — something is blocking:

sudo strace -c -p $(pgrep my_app)
  % time     seconds  usecs/call     calls  syscall
  ------ ----------- ----------- --------- --------
   45.32    0.234000         234      1000  read      ← sensor blocking
   22.10    0.114000          57      2000  ioctl     ← DRM calls
    8.50    0.043900          44      1000  nanosleep ← idle (good)

If read() dominates → sensor I/O is blocking the render loop. Fix: Move sensor read to a separate thread.


The Profiling Workflow

Always the same five steps:

1. MEASURE    "How slow?"           FPS counter, mpstat
2. PROFILE    "Where?"              perf or strace
3. IDENTIFY   "Why?"                Blocking I/O? Copies? VSync?
4. FIX        "One change"          Move thread, remove stage, pin core
5. VERIFY     "Did it help?"        Same measurement as step 1

The most common mistake: jumping to step 4 without steps 1-3.


Optimization Priority — Cheapest First

# Optimization Effort Example
1 Remove unnecessary work Minutes Disable morphology in ball detection
2 Fix architecture Hours Move sensor off render thread
3 Use the right API Hours DRM page flip instead of fbdev memcpy
4 Core partitioning Minutes taskset -c 1 for sensor thread
5 DMA for peripherals Days SPI DMA for high-rate sensors
6 Cache / SIMD Days NEON intrinsics, struct packing

Do not start at #5. Most slow projects need #1 or #2.

Reference: Performance Profiling — full tool guide with examples.


Class Exercise: Profile the Ball Detection Pipeline

Open two terminals on the Pi:

# Terminal 1: Run ball detection with FPS display
python3 ball_detection.py

# Terminal 2: Profile it
sudo perf record -g -p $(pgrep python3) -- sleep 10
sudo perf report

Questions to answer: 1. Which OpenCV function takes the most CPU time? 2. If you disable morphology (m key), how does the profile change? 3. What is the FPS before and after? 4. Draw the pipeline with timing per stage.


Mini Exercise — Estimate Your Latency

You have a Pi 4 with PREEMPT_RT. IMU at 200 Hz over SPI, HDMI display at 60 fps, DRM/KMS double buffering.

Stage Best Case Worst Case
SPI read (14 bytes @ 1 MHz) ? ms ? ms
Filter (3-tap FIR) ? ms ? ms
Render (SDL2 software) ? ms ? ms
VSync wait ? ms ? ms
Scan-out to center ? ms ? ms
Total ? ms ? ms

Fill in your estimates. We will compare answers.


Mini Exercise — Reference Answers

Stage Best Case Worst Case Reasoning
SPI read 0.1 ms 0.2 ms 14 bytes x 8 bits / 1 MHz
Filter (3-tap FIR) 0.01 ms 0.05 ms 3 multiplies + adds
Render (SDL2) 2 ms 5 ms Software blit
VSync wait 0 ms 16.7 ms Alignment luck
Scan-out to center 0 ms 8.3 ms Half of 16.7 ms
Total ~2 ms ~30 ms Nearly 2 frames

The VSync wait is the wild card. Everything else is bounded and small.


Quick Checks

  1. What causes tearing, and how does double buffering prevent it?
  2. Why is DRM/KMS preferred over fbdev for real-time display?
  3. What is the worst-case VSync wait at 60 Hz?
  4. Does PREEMPT_RT always help display performance? Why or why not?
  5. Why does CPU isolation alone not match the benefit of an RT kernel?

Key Takeaways

  • Tearing is a display pipeline problem solved by VSync-aligned page flipping, not by faster rendering.
  • Total input-to-display latency is the sum of every pipeline stage; the VSync wait often dominates.
  • PREEMPT_RT improves sensor determinism but can introduce micro-delays in the GPU/DRM path.
  • Core partitioning (isolcpus) separates deterministic sensor work from best-effort display work.
  • RT kernel + CPU isolation together give the best result. RT kernel alone helps more than isolation alone.
  • Use lock-free shared state (atomic variables) between sensor and render threads.

Hands-On Next

Three labs connect to this theory:

DRM/KMS Test Use page flipping and VSync events to build a tear-free display application.

Display Applications Build interactive sensor-driven displays with SDL2 and DRM/KMS.

PREEMPT_RT Latency Measure jitter with and without core isolation. Build histograms. Compare the four configurations.