Lesson 6: Graphics Applications & Profiling
Óbuda University -- Linux in Embedded Systems
"The display works but stutters under load. Why?"
Problem First
Your level display application works on the bench: the IMU reads tilt and a bubble moves on screen.
But under CPU load, the bubble stutters and the image tears. The sensor loop still runs at the correct rate.
The problem is somewhere between "data ready" and "pixels visible."
This is a display pipeline problem, not a sensor problem.
Today's Map
- Block 1 (45 min): Display pipeline: how displays work, VSync/page flip, double/triple buffering, fbdev vs DRM/KMS, sensor-to-pixel pipeline, latency breakdown.
- Block 2 (45 min): RT kernel plus display: PREEMPT_RT and graphics, CPU core partitioning,
isolcpusand IRQ affinity, DMA for peripherals,cyclictestunder load, architecture patterns.
How a Display Works
A display panel scans out pixels line by line at a fixed rate.
At 60 Hz, one full frame is scanned every 16.7 ms.
Line 0 ████████████████████████████████ <-- scan starts here
Line 1 ████████████████████████████████
...
Line 539 ████████████████████████████████
--------------------------------
VBlank interval (~1-2 ms) <-- gap between frames
Line 0 ████████████████████████████████ <-- next frame starts
The VBlank interval is the short gap between the last line of one frame and the first line of the next.
What Is Tearing?
If you write new pixel data while the display is scanning, the top half shows the old frame and the bottom half shows the new one.
+--------------------------------+
| |
| OLD FRAME (already scanned) |
| |
|================================| <-- scan position when buffer changed
| |
| NEW FRAME (written too early) |
| |
+--------------------------------+
This visible seam is called tearing. It happens because the write and the scan-out are not synchronized.
VSync — The Synchronization Point
VSync = synchronize buffer updates to the VBlank interval.
The rule: never change the displayed buffer while the panel is scanning it.
Wait for VBlank, then swap. The panel always reads a complete, consistent frame.
Time -->
| Render | Wait | Flip | Render | Wait | Flip |
| frame | VBlank | ptrs | frame | VBlank | ptrs |
|__________|________|________|__________|________|________|
^ ^
VBlank VBlank
Double Buffering
Double buffering is the mechanism that makes VSync possible.
Two buffers exist in memory:
| Buffer | Role | CPU Access | Display Access |
|---|---|---|---|
| Back buffer | Being rendered | Write | None |
| Front buffer | Being displayed | None | Read (scan-out) |
The render loop writes to the back buffer. At VBlank, the pointers swap. No data is copied -- only the pointer changes.
Double Buffering — Step by Step
Step 1: Render to back buffer Step 2: Wait for VBlank
+----------+ +----------+ +----------+ +----------+
| Back | | Front | | Back | | Front |
| [drawing]| | [display]| ---> | [done] | | [display]|
+----------+ +----------+ +----------+ +----------+
CPU writes Panel reads CPU idle Panel reads
Step 3: Flip (swap pointers) Step 4: Render next frame
+----------+ +----------+ +----------+ +----------+
| NEW Front| | NEW Back | | Front | | Back |
| [display]| | [free] | ---> | [display]| | [drawing]|
+----------+ +----------+ +----------+ +----------+
Panel reads CPU can write Panel reads CPU writes
Atomic from the display's perspective: zero tearing.
Triple Buffering — The Trade-Off
Triple buffering adds a third buffer so the CPU never stalls waiting for VBlank.
+----------+ +----------+ +----------+
| Buffer A | | Buffer B | | Buffer C |
| [display]| | [ready] | | [drawing]|
+----------+ +----------+ +----------+
Panel reads Queued next CPU writes
| Property | Double Buffering | Triple Buffering |
|---|---|---|
| Max latency | 1 frame (16.7 ms) | 2 frames (33.4 ms) |
| CPU stall on VSync | Yes (waits) | No (extra buffer) |
| Memory usage | 2x framebuffer | 3x framebuffer |
| Use case | Low latency | Smooth throughput |
Trade-off: smoothness vs responsiveness.
fbdev — The Legacy Interface
The framebuffer device (/dev/fb0) exposes a single memory-mapped buffer.
// Pseudocode: fbdev memory-mapped framebuffer access
int fd = open("/dev/fb0", O_RDWR);
// mmap: offset 0, length = screen_size, PROT_READ|PROT_WRITE, MAP_SHARED
char *fb = mmap(NULL, screen_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
memcpy(fb, pixels, screen_size); // Write = instant display (may tear)
| Property | fbdev |
|---|---|
| Page flip | None (manual copy) |
| VSync notification | No standard API |
| Atomic commit | No |
| Tear-free | No |
| Status | Deprecated |
Writing to the mapped buffer during scan-out will tear. No way around it with fbdev alone.
DRM/KMS — The Modern Interface
DRM = Direct Rendering Manager. KMS = Kernel Mode Setting.
DRM/KMS provides hardware-assisted display pipeline control.
Userspace Kernel (DRM/KMS) Hardware
+----------------+ +------------------+ +--------+
| Render to | ioctl | Schedule flip | IRQ | Panel |
| dumb buffer |--------->| at next VBlank |------>| scans |
| | event | Notify complete | | out |
| |<---------| | | |
+----------------+ +------------------+ +--------+
The kernel handles the timing. You just say "flip when ready."
fbdev vs DRM/KMS Comparison
| Feature | fbdev | DRM/KMS |
|---|---|---|
| Page flip | Manual (copy) | Hardware-assisted |
| VSync notification | None | drmWaitVBlank / event |
| Atomic commit | No | Yes |
| Tear-free guarantee | No | Yes (with page flip) |
| Multi-plane support | No | Yes |
Rule: For any application where visual smoothness matters, use DRM/KMS.
Reserve fbdev for quick prototypes where tearing is acceptable.
The Sensor-to-Pixel Pipeline
In a real-time display application (e.g., IMU-driven level indicator), data flows through multiple stages. Each stage adds latency.
+----------+ +--------+ +--------+ +---------+ +---------+
| IMU |---->| Filter |---->| Shared |---->| Render |---->| Page |
| Read | | | | State | | Frame | | Flip |
| ~1 ms | | ~0.1ms | | | | ~2-5 ms | | 0-17 ms |
+----------+ +--------+ +--------+ +---------+ +---------+
| |
| I2C/SPI Display scan-out
| ~8 ms to center
Total input-to-display latency = sum of all stages.
Latency Breakdown — Best vs Worst Case
| Stage | Best Case | Worst Case | Notes |
|---|---|---|---|
| IMU read (SPI) | 1 ms | 1 ms | Fixed by clock rate |
| Filter | 0.1 ms | 0.1 ms | Deterministic |
| Render | 2 ms | 5 ms | Depends on scene |
| VSync wait | 0 ms | 16.7 ms | Largest variable |
| Scan-out to center | 0 ms | 8 ms | Half frame time |
| Total | ~3 ms | ~31 ms | Nearly 2 frames |
The VSync wait dominates. If you just missed VBlank, you wait a full frame period.
Best case: render finishes just before VBlank. Worst case: render finishes just after VBlank.
Why VSync Wait Dominates
VBlank VBlank VBlank VBlank
| | | |
v v v v
--|-------------|-------------|-------------|---> time
16.7 ms 16.7 ms 16.7 ms
Case A (best): Render done here | Flip!
^--- 0 ms wait
Case B (worst): Render done here | Flip!
^--- 16.7 ms wait
You cannot control when the render finishes relative to VBlank.
This is why the VSync wait is 0 to 16.7 ms -- it is purely a timing alignment issue.
Block 1 Summary
- Tearing is caused by writing to the display buffer during scan-out
- VSync + double buffering = tear-free display
- DRM/KMS provides hardware-assisted page flipping; fbdev cannot
- Sensor-to-pixel latency is the sum of all pipeline stages
- VSync wait is the largest variable: 0 to 16.7 ms at 60 Hz
- Triple buffering trades latency for throughput
Block 2 — RT Kernel and Display
"Which change helps more -- RT kernel or CPU isolation?"
PREEMPT_RT and Graphics — Not All Good News
PREEMPT_RT improves sensor loop determinism. But it also changes how the display pipeline behaves.
Key insight: PREEMPT_RT makes interrupt handlers preemptible -- including GPU and DRM interrupts.
Your high-priority sensor thread can now preempt the display interrupt handler.
This can introduce micro-delays in the display path that would not exist on a standard kernel.
RT Kernel Effects — Both Sides
| Change | Sensor Effect | Display Effect |
|---|---|---|
| Threaded IRQs | Sensor IRQ has schedulable priority | GPU IRQ preemptible |
| Sleeping spinlocks | Sensor driver is preemptible | GPU driver sees micro-delays |
| Priority inheritance | Prevents sensor mutex inversion | Prevents display mutex inversion |
| Deterministic scheduler | Jitter < 50 us | Consistent render scheduling |
PREEMPT_RT helps sensors significantly. On the display side, it can introduce micro-latency because GPU/DRM interrupts are now preemptible.
The Fundamental Conflict
The sensor loop needs determinism: guaranteed scheduling within microseconds.
The display loop needs throughput: render a full frame within 16.7 ms.
Sensor thread (RT prio 80):
|--run--|--sleep--|--run--|--sleep--|--run--|
Render thread (normal prio):
|=======render========|flip|=======render========|flip|
What happens when they share a core:
|==render==|PREEMPT|==render===|PREEMPT|==render==|flip|
sensor sensor
runs runs
here here
On a shared core, the sensor preempts the renderer. The frame may not finish before VBlank.
The Solution: CPU Core Partitioning
Do not share. Dedicate specific cores to specific tasks.
+---------------------------------------------------+
| Raspberry Pi 4 — 4 Cores |
| |
| Core 0 (General) Core 1 (Isolated RT) |
| +-----------------+ +---------------------+ |
| | Display + Render| | Sensor Read (p=80) | |
| | Hardware IRQs | | Filter + Control | |
| | systemd, logs | | (p=70) | |
| +-----------------+ +---------------------+ |
| |
| Core 2 (Isolated) Core 3 (Isolated) |
| +-----------------+ +---------------------+ |
| | Data Logger | | Spare / additional | |
| | (p=50) | | RT tasks | |
| +-----------------+ +---------------------+ |
+---------------------------------------------------+
isolcpus — Removing Cores from the Scheduler
This builds on the CPU isolation concepts from Lesson 7 (Real-Time Systems). Review those slides if you need a refresher on why isolation matters for determinism.
Kernel boot parameters control core partitioning:
| Parameter | Purpose |
|---|---|
isolcpus=1-3 |
Remove cores 1-3 from general scheduler |
nohz_full=1-3 |
Disable timer ticks on isolated cores |
rcu_nocbs=1-3 |
Offload RCU callbacks from isolated cores |
After boot, the kernel will only schedule tasks on Core 0 unless you explicitly pin a task to an isolated core.
Pinning Tasks to Isolated Cores
Use taskset to pin RT threads to isolated cores:
# Pin sensor thread to Core 1, RT priority 80
sudo taskset -c 1 chrt -f 80 ./sensor_loop
# Pin data logger to Core 2, RT priority 50
sudo taskset -c 2 chrt -f 50 ./data_logger
# Render thread stays on Core 0 (default scheduler)
./render_app
Or from C code:
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(1, &cpuset); // Pin to Core 1
pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset);
IRQ Affinity — Controlling Interrupts
Pin hardware interrupts to specific cores so they do not disturb isolated RT tasks:
# Find the IRQ number for SPI (sensor)
cat /proc/interrupts | grep spi
# Pin SPI interrupt to Core 1 (where sensor runs)
echo 0x2 > /proc/irq/42/smp_affinity # 0x2 = binary 0010 = Core 1
# Pin GPU/display interrupt to Core 0
echo 0x1 > /proc/irq/56/smp_affinity # 0x1 = binary 0001 = Core 0
| IRQ | Pin To | Reason |
|---|---|---|
| SPI (sensor) | Core 1 (isolated) | Low-latency wakeup |
| GPU / DRM | Core 0 (general) | Keep with render thread |
| USB, Ethernet | Core 0 (general) | Non-RT traffic |
Shared State Between Cores
Sensor (Core 1) and Render (Core 0) must exchange data. Use lock-free techniques:
Core 1 (Sensor) Core 0 (Render)
+------------------+ +------------------+
| Read IMU | | Read shared_angle|
| Filter | | Render frame |
| atomic_store( | -----> | using angle |
| shared_angle) | | Page flip |
+------------------+ +------------------+
// Sensor side (Core 1):
atomic_store(&shared_angle, filtered_angle);
// Render side (Core 0):
float angle = atomic_load(&shared_angle);
No mutex. No lock. No priority inversion. The render thread never blocks the sensor thread.
DMA for Peripheral Access
When sensors are read at high rates or displays need large transfers, the CPU access method matters:
| Method | CPU Load (1 kHz IMU) | Latency | Jitter |
|---|---|---|---|
| Polling (busy-wait) | ~30-50% of core | Lowest | Lowest |
| Interrupt-driven | ~5-10% | Low | Low |
| DMA transfer | ~1-2% | Medium | Medium |
DMA = Direct Memory Access. The hardware moves data without CPU involvement.
When DMA Matters vs Overkill
Use DMA when:
- High sample rates (> 1 kHz) where interrupt overhead accumulates
- Large block transfers (display framebuffers over SPI)
- Bulk sensor reads (accelerometer FIFO burst reads)
Skip DMA when:
- Low-rate sensors (temperature every second)
- Small transfers (single register reads)
- One-shot configuration writes
Rule of thumb: For IMU at 100-500 Hz, interrupt-driven SPI is sufficient. For SPI display at 30 fps, DMA frees the CPU for other work.
DMA Transfer — How It Works
Without DMA: With DMA:
+-----+ +-----+ +-----+ +-----+
| CPU |<--->| SPI | | CPU | | SPI |
| | | dev | | | | dev |
+-----+ +-----+ +--+--+ +--+--+
CPU reads each byte | |
one at a time. | +-----+ |
CPU is 100% busy | | DMA | |
during transfer. | | eng.| |
| +--+--+ |
| | |
1.Setup 2.Transfer
(CPU) (no CPU)
3.IRQ done
CPU sets up the transfer, then is free to do other work. DMA engine handles the byte-by-byte movement.
cyclictest — Measuring Scheduling Latency
cyclictest measures the time between when a thread should wake and when it actually wakes.
| Flag | Meaning |
|---|---|
-t1 |
One thread |
-p99 |
RT priority 99 (highest) |
-a3 |
Pin to CPU 3 |
-i1000 |
1000 us interval (1 kHz) |
-l100000 |
100,000 loops |
Output: min, avg, max latency. The max is what matters.
Running cyclictest Under Load
Always test under stress. Idle system latency is meaningless.
# Terminal 1: Generate CPU + I/O stress
stress-ng --cpu 4 --io 2 --vm 1 \
--vm-bytes 128M --timeout 120s &
# Terminal 2: Measure while stressed
sudo cyclictest -t1 -p99 -a3 -i1000 \
-l100000 -h400 > histogram.txt
The -h400 flag creates a histogram with 400 us buckets. This tells you not just the max, but the shape of the latency distribution.
Reading a cyclictest Histogram
Latency (us) | Count
----------------+-------------------------------------------
0 - 10 | ################################ 89,200
10 - 20 | ##### 4,100
20 - 50 | ### 2,500
50 - 100 | ## 1,800
100 - 200 | | 350
200+ | . 50
Good histogram: tall and narrow. Most samples cluster near minimum. Short tail.
The tail determines your worst-case guarantee. A single spike at 500 us means your guarantee is 500 us, regardless of the 50 us average.
Good vs Bad Histograms
GOOD (PREEMPT_RT + isolcpus): BAD (Standard kernel, shared cores):
################################ ########
### ######
## #####
# ####
| ###
##
#
| <-- long tail = missed deadlines
|---|---|---|---|---|---|---| |---|---|---|---|---|---|---|---|---|
0 20 40 60 80 100 us 0 50 100 200 400 800 1600 us
Good: tight cluster, short tail. Predictable. Bad: wide spread, long tail. Cannot make guarantees.
The Prediction Question
Which change helps more for a sensor-display system?
A) Switching from standard kernel to PREEMPT_RT kernel
B) Keeping standard kernel but adding CPU isolation (isolcpus)
Think about it before the next slide.
Testing the Prediction
Four configurations to test with cyclictest:
| Config | Kernel | CPU Isolation | Expected Max Latency |
|---|---|---|---|
| 1 | Standard | None | ~1-10 ms |
| 2 | Standard | isolcpus=3 |
~0.5-2 ms |
| 3 | PREEMPT_RT | None | ~50-150 us |
| 4 | PREEMPT_RT | isolcpus=3 |
~20-80 us |
# Config 1: Standard, no isolation
sudo cyclictest -t1 -p99 -a0 -i1000 -l100000
# Config 4: RT kernel + isolated core
sudo cyclictest -t1 -p99 -a3 -i1000 -l100000
Interpreting the Results
Max latency (us, log scale):
Config 1: |======================================| ~5000 us
Config 2: |==============| ~1500 us
Config 3: |==| ~120 us
Config 4: |=| ~50 us
Answer: PREEMPT_RT helps far more than CPU isolation alone.
- RT kernel reduces worst case by ~40x (5000 -> 120 us)
- CPU isolation reduces it by ~3x (5000 -> 1500 us)
- Combined gives the best result (~50 us)
CPU isolation is good. RT kernel is essential. Use both.
Why RT Kernel Wins
CPU isolation removes scheduler contention but does not fix:
- Non-preemptible interrupt handlers (can block for ms)
- Spinlock-held critical sections
- RCU callback storms
PREEMPT_RT fixes all three:
- Interrupt handlers become schedulable threads
- Spinlocks become sleeping mutexes
- RCU callbacks are offloaded
Isolation reduces competition. RT kernel reduces non-preemptible time. They address different problems.
The Complete Architecture
+-----------------------------------------------------------+
| KERNEL: PREEMPT_RT BOOT: isolcpus=1-3 nohz_full=1-3 |
+-----------------------------------------------------------+
| |
| Core 0 (General) Core 1 (Isolated, RT) |
| +------------------------+ +------------------------+ |
| | Render thread (normal) | | Sensor thread (p=80) | |
| | DRM/KMS page flip |<---| atomic_store(angle) | |
| | GPU + display IRQs | | SPI IRQ pinned here | |
| | systemd, networking | | Timer tick disabled | |
| +------------------------+ +------------------------+ |
| |
| Core 2-3 (Isolated, RT) |
| +------------------------------------------------------+ |
| | Data logger, control, spare RT tasks | |
| +------------------------------------------------------+ |
+-----------------------------------------------------------+
Profiling Your Own Code
You know how to measure jitter with cyclictest. But when your project is slow, you need to find where in your code the time goes.
Three tools, three questions:
| Tool | Question | Use When |
|---|---|---|
mpstat -P ALL 1 |
Which cores are busy? | Always — first tool |
perf record -g -p PID |
Which functions eat CPU? | CPU-bound (high %) |
strace -c -p PID |
Which syscalls block? | I/O-bound (low CPU, slow) |
Live Demo: perf record → perf report
# Step 1: Start your app
sudo SDL_VIDEODRIVER=kmsdrm ./level_sdl2 &
# Step 2: Record 10 seconds of CPU samples
sudo perf record -g -p $(pgrep level_sdl2) -- sleep 10
# Step 3: See where time goes
sudo perf report
Overhead Symbol
-------- ------------------------------------------
32.50% render_horizon ← hot function
18.20% memcpy ← unnecessary copies?
12.80% imu_read_spi ← expected
9.40% SDL_RenderPresent ← flip/VSync wait
Rule: Optimize the top line first. Ignore anything below 5%.
Live Demo: strace -c
When CPU is low but app is still slow — something is blocking:
% time seconds usecs/call calls syscall
------ ----------- ----------- --------- --------
45.32 0.234000 234 1000 read ← sensor blocking
22.10 0.114000 57 2000 ioctl ← DRM calls
8.50 0.043900 44 1000 nanosleep ← idle (good)
If read() dominates → sensor I/O is blocking the render loop.
Fix: Move sensor read to a separate thread.
The Profiling Workflow
Always the same five steps:
1. MEASURE "How slow?" FPS counter, mpstat
↓
2. PROFILE "Where?" perf or strace
↓
3. IDENTIFY "Why?" Blocking I/O? Copies? VSync?
↓
4. FIX "One change" Move thread, remove stage, pin core
↓
5. VERIFY "Did it help?" Same measurement as step 1
The most common mistake: jumping to step 4 without steps 1-3.
Optimization Priority — Cheapest First
| # | Optimization | Effort | Example |
|---|---|---|---|
| 1 | Remove unnecessary work | Minutes | Disable morphology in ball detection |
| 2 | Fix architecture | Hours | Move sensor off render thread |
| 3 | Use the right API | Hours | DRM page flip instead of fbdev memcpy |
| 4 | Core partitioning | Minutes | taskset -c 1 for sensor thread |
| 5 | DMA for peripherals | Days | SPI DMA for high-rate sensors |
| 6 | Cache / SIMD | Days | NEON intrinsics, struct packing |
Do not start at #5. Most slow projects need #1 or #2.
Reference: Performance Profiling — full tool guide with examples.
Class Exercise: Profile the Ball Detection Pipeline
Open two terminals on the Pi:
# Terminal 1: Run ball detection with FPS display
python3 ball_detection.py
# Terminal 2: Profile it
sudo perf record -g -p $(pgrep python3) -- sleep 10
sudo perf report
Questions to answer:
1. Which OpenCV function takes the most CPU time?
2. If you disable morphology (m key), how does the profile change?
3. What is the FPS before and after?
4. Draw the pipeline with timing per stage.
Mini Exercise — Estimate Your Latency
You have a Pi 4 with PREEMPT_RT. IMU at 200 Hz over SPI, HDMI display at 60 fps, DRM/KMS double buffering.
| Stage | Best Case | Worst Case |
|---|---|---|
| SPI read (14 bytes @ 1 MHz) | ? ms | ? ms |
| Filter (3-tap FIR) | ? ms | ? ms |
| Render (SDL2 software) | ? ms | ? ms |
| VSync wait | ? ms | ? ms |
| Scan-out to center | ? ms | ? ms |
| Total | ? ms | ? ms |
Fill in your estimates. We will compare answers.
Mini Exercise — Reference Answers
| Stage | Best Case | Worst Case | Reasoning |
|---|---|---|---|
| SPI read | 0.1 ms | 0.2 ms | 14 bytes x 8 bits / 1 MHz |
| Filter (3-tap FIR) | 0.01 ms | 0.05 ms | 3 multiplies + adds |
| Render (SDL2) | 2 ms | 5 ms | Software blit |
| VSync wait | 0 ms | 16.7 ms | Alignment luck |
| Scan-out to center | 0 ms | 8.3 ms | Half of 16.7 ms |
| Total | ~2 ms | ~30 ms | Nearly 2 frames |
The VSync wait is the wild card. Everything else is bounded and small.
Quick Checks
- What causes tearing, and how does double buffering prevent it?
- Why is DRM/KMS preferred over fbdev for real-time display?
- What is the worst-case VSync wait at 60 Hz?
- Does PREEMPT_RT always help display performance? Why or why not?
- Why does CPU isolation alone not match the benefit of an RT kernel?
Key Takeaways
- Tearing is a display pipeline problem solved by VSync-aligned page flipping, not by faster rendering.
- Total input-to-display latency is the sum of every pipeline stage; the VSync wait often dominates.
- PREEMPT_RT improves sensor determinism but can introduce micro-delays in the GPU/DRM path.
- Core partitioning (
isolcpus) separates deterministic sensor work from best-effort display work. - RT kernel + CPU isolation together give the best result. RT kernel alone helps more than isolation alone.
- Use lock-free shared state (atomic variables) between sensor and render threads.
Hands-On Next
Three labs connect to this theory:
DRM/KMS Test Use page flipping and VSync events to build a tear-free display application.
Display Applications Build interactive sensor-driven displays with SDL2 and DRM/KMS.
PREEMPT_RT Latency Measure jitter with and without core isolation. Build histograms. Compare the four configurations.