Lesson 6: Graphics Applications & Profiling

Óbuda University -- Linux in Embedded Systems

"The display works but stutters under load. Why?"

Problem First

Your level display application works on the bench: the IMU reads tilt and a bubble moves on screen.

But under CPU load, the bubble stutters and the image tears. The sensor loop still runs at the correct rate.

The problem is somewhere between "data ready" and "pixels visible."

This is a display pipeline problem, not a sensor problem.

Today's Map

Block 1 (45 min): Display pipeline: how displays work, VSync/page flip, double/triple buffering, fbdev vs DRM/KMS, sensor-to-pixel pipeline, latency breakdown.
Block 2 (45 min): RT kernel plus display: PREEMPT_RT and graphics, CPU core partitioning, isolcpus and IRQ affinity, DMA for peripherals, cyclictest under load, architecture patterns.

How a Display Works

A display panel scans out pixels line by line at a fixed rate.

At 60 Hz, one full frame is scanned every 16.7 ms.

  Line 0   ████████████████████████████████  <-- scan starts here
  Line 1   ████████████████████████████████
  ...
  Line 539 ████████████████████████████████
            --------------------------------
            VBlank interval (~1-2 ms)         <-- gap between frames
  Line 0   ████████████████████████████████  <-- next frame starts

The VBlank interval is the short gap between the last line of one frame and the first line of the next.

What Is Tearing?

If you write new pixel data while the display is scanning, the top half shows the old frame and the bottom half shows the new one.

  +--------------------------------+
  |                                |
  |  OLD FRAME (already scanned)  |
  |                                |
  |================================| <-- scan position when buffer changed
  |                                |
  |  NEW FRAME (written too early) |
  |                                |
  +--------------------------------+

This visible seam is called tearing. It happens because the write and the scan-out are not synchronized.

VSync — The Synchronization Point

VSync = synchronize buffer updates to the VBlank interval.

The rule: never change the displayed buffer while the panel is scanning it.

Wait for VBlank, then swap. The panel always reads a complete, consistent frame.

  Time -->
  |  Render  |  Wait  |  Flip  |  Render  |  Wait  |  Flip  |
  |  frame   | VBlank |  ptrs  |  frame   | VBlank |  ptrs  |
  |__________|________|________|__________|________|________|
             ^                            ^
          VBlank                       VBlank

Double Buffering

Double buffering is the mechanism that makes VSync possible.

Two buffers exist in memory:

Buffer	Role	CPU Access	Display Access
Back buffer	Being rendered	Write	None
Front buffer	Being displayed	None	Read (scan-out)

The render loop writes to the back buffer. At VBlank, the pointers swap. No data is copied -- only the pointer changes.

Double Buffering — Step by Step

  Step 1: Render to back buffer        Step 2: Wait for VBlank
  +----------+    +----------+         +----------+    +----------+
  | Back     |    | Front    |         | Back     |    | Front    |
  | [drawing]|    | [display]|  --->   | [done]   |    | [display]|
  +----------+    +----------+         +----------+    +----------+
   CPU writes      Panel reads          CPU idle        Panel reads

  Step 3: Flip (swap pointers)         Step 4: Render next frame
  +----------+    +----------+         +----------+    +----------+
  | NEW Front|    | NEW Back |         | Front    |    | Back     |
  | [display]|    | [free]   |  --->   | [display]|    | [drawing]|
  +----------+    +----------+         +----------+    +----------+
   Panel reads     CPU can write        Panel reads     CPU writes

Atomic from the display's perspective: zero tearing.

Triple Buffering — The Trade-Off

Triple buffering adds a third buffer so the CPU never stalls waiting for VBlank.

  +----------+    +----------+    +----------+
  | Buffer A |    | Buffer B |    | Buffer C |
  | [display]|    | [ready]  |    | [drawing]|
  +----------+    +----------+    +----------+
   Panel reads     Queued next     CPU writes

Property	Double Buffering	Triple Buffering
Max latency	1 frame (16.7 ms)	2 frames (33.4 ms)
CPU stall on VSync	Yes (waits)	No (extra buffer)
Memory usage	2x framebuffer	3x framebuffer
Use case	Low latency	Smooth throughput

Trade-off: smoothness vs responsiveness.

fbdev — The Legacy Interface

The framebuffer device (/dev/fb0) exposes a single memory-mapped buffer.

// Pseudocode: fbdev memory-mapped framebuffer access
int fd = open("/dev/fb0", O_RDWR);
// mmap: offset 0, length = screen_size, PROT_READ|PROT_WRITE, MAP_SHARED
char *fb = mmap(NULL, screen_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
memcpy(fb, pixels, screen_size);  // Write = instant display (may tear)

Property	fbdev
Page flip	None (manual copy)
VSync notification	No standard API
Atomic commit	No
Tear-free	No
Status	Deprecated

Writing to the mapped buffer during scan-out will tear. No way around it with fbdev alone.

DRM/KMS — The Modern Interface

DRM = Direct Rendering Manager. KMS = Kernel Mode Setting.

DRM/KMS provides hardware-assisted display pipeline control.

  Userspace                    Kernel (DRM/KMS)          Hardware
  +----------------+          +------------------+       +--------+
  | Render to      |  ioctl   | Schedule flip    | IRQ   | Panel  |
  | dumb buffer    |--------->| at next VBlank   |------>| scans  |
  |                |  event   | Notify complete  |       | out    |
  |                |<---------|                  |       |        |
  +----------------+          +------------------+       +--------+

The kernel handles the timing. You just say "flip when ready."

fbdev vs DRM/KMS Comparison

Feature	fbdev	DRM/KMS
Page flip	Manual (copy)	Hardware-assisted
VSync notification	None	`drmWaitVBlank` / event
Atomic commit	No	Yes
Tear-free guarantee	No	Yes (with page flip)
Multi-plane support	No	Yes

Rule: For any application where visual smoothness matters, use DRM/KMS.

Reserve fbdev for quick prototypes where tearing is acceptable.

The Sensor-to-Pixel Pipeline

In a real-time display application (e.g., IMU-driven level indicator), data flows through multiple stages. Each stage adds latency.

  +----------+     +--------+     +--------+     +---------+     +---------+
  |   IMU    |---->| Filter |---->| Shared |---->| Render  |---->|  Page   |
  |   Read   |     |        |     | State  |     |  Frame  |     |  Flip   |
  | ~1 ms    |     | ~0.1ms |     |        |     | ~2-5 ms |     | 0-17 ms |
  +----------+     +--------+     +--------+     +---------+     +---------+
       |                                                               |
       |  I2C/SPI                                              Display scan-out
       |                                                         ~8 ms to center

Total input-to-display latency = sum of all stages.

Latency Breakdown — Best vs Worst Case

Stage	Best Case	Worst Case	Notes
IMU read (SPI)	1 ms	1 ms	Fixed by clock rate
Filter	0.1 ms	0.1 ms	Deterministic
Render	2 ms	5 ms	Depends on scene
VSync wait	0 ms	16.7 ms	Largest variable
Scan-out to center	0 ms	8 ms	Half frame time
Total	~3 ms	~31 ms	Nearly 2 frames

The VSync wait dominates. If you just missed VBlank, you wait a full frame period.

Best case: render finishes just before VBlank. Worst case: render finishes just after VBlank.

Why VSync Wait Dominates

  VBlank        VBlank        VBlank        VBlank
    |             |             |             |
    v             v             v             v
  --|-------------|-------------|-------------|---> time
       16.7 ms       16.7 ms       16.7 ms

  Case A (best):  Render done here |         Flip!
                                   ^--- 0 ms wait

  Case B (worst): Render done here  |                    Flip!
                                    ^--- 16.7 ms wait

You cannot control when the render finishes relative to VBlank.

This is why the VSync wait is 0 to 16.7 ms -- it is purely a timing alignment issue.

Block 1 Summary

Tearing is caused by writing to the display buffer during scan-out
VSync + double buffering = tear-free display
DRM/KMS provides hardware-assisted page flipping; fbdev cannot
Sensor-to-pixel latency is the sum of all pipeline stages
VSync wait is the largest variable: 0 to 16.7 ms at 60 Hz
Triple buffering trades latency for throughput

Block 2 — RT Kernel and Display

"Which change helps more -- RT kernel or CPU isolation?"

PREEMPT_RT and Graphics — Not All Good News

PREEMPT_RT improves sensor loop determinism. But it also changes how the display pipeline behaves.

Key insight: PREEMPT_RT makes interrupt handlers preemptible -- including GPU and DRM interrupts.

Your high-priority sensor thread can now preempt the display interrupt handler.

This can introduce micro-delays in the display path that would not exist on a standard kernel.

RT Kernel Effects — Both Sides

Change	Sensor Effect	Display Effect
Threaded IRQs	Sensor IRQ has schedulable priority	GPU IRQ preemptible
Sleeping spinlocks	Sensor driver is preemptible	GPU driver sees micro-delays
Priority inheritance	Prevents sensor mutex inversion	Prevents display mutex inversion
Deterministic scheduler	Jitter < 50 us	Consistent render scheduling

PREEMPT_RT helps sensors significantly. On the display side, it can introduce micro-latency because GPU/DRM interrupts are now preemptible.

The Fundamental Conflict

The sensor loop needs determinism: guaranteed scheduling within microseconds.

The display loop needs throughput: render a full frame within 16.7 ms.

  Sensor thread (RT prio 80):
  |--run--|--sleep--|--run--|--sleep--|--run--|

  Render thread (normal prio):
  |=======render========|flip|=======render========|flip|

  What happens when they share a core:
  |==render==|PREEMPT|==render===|PREEMPT|==render==|flip|
              sensor              sensor
              runs                runs
              here                here

On a shared core, the sensor preempts the renderer. The frame may not finish before VBlank.

The Solution: CPU Core Partitioning

Do not share. Dedicate specific cores to specific tasks.

  +---------------------------------------------------+
  |  Raspberry Pi 4 — 4 Cores                         |
  |                                                    |
  |  Core 0 (General)     Core 1 (Isolated RT)        |
  |  +-----------------+  +---------------------+     |
  |  | Display + Render|  | Sensor Read (p=80)  |     |
  |  | Hardware IRQs   |  | Filter + Control    |     |
  |  | systemd, logs   |  | (p=70)              |     |
  |  +-----------------+  +---------------------+     |
  |                                                    |
  |  Core 2 (Isolated)    Core 3 (Isolated)            |
  |  +-----------------+  +---------------------+     |
  |  | Data Logger     |  | Spare / additional  |     |
  |  | (p=50)          |  | RT tasks            |     |
  |  +-----------------+  +---------------------+     |
  +---------------------------------------------------+

isolcpus — Removing Cores from the Scheduler

This builds on the CPU isolation concepts from Lesson 7 (Real-Time Systems). Review those slides if you need a refresher on why isolation matters for determinism.

Kernel boot parameters control core partitioning:

Parameter	Purpose
`isolcpus=1-3`	Remove cores 1-3 from general scheduler
`nohz_full=1-3`	Disable timer ticks on isolated cores
`rcu_nocbs=1-3`	Offload RCU callbacks from isolated cores

# In /boot/cmdline.txt (Raspberry Pi):
... isolcpus=1-3 nohz_full=1-3 rcu_nocbs=1-3

After boot, the kernel will only schedule tasks on Core 0 unless you explicitly pin a task to an isolated core.

Pinning Tasks to Isolated Cores

Use taskset to pin RT threads to isolated cores:

# Pin sensor thread to Core 1, RT priority 80
sudo taskset -c 1 chrt -f 80 ./sensor_loop

# Pin data logger to Core 2, RT priority 50
sudo taskset -c 2 chrt -f 50 ./data_logger

# Render thread stays on Core 0 (default scheduler)
./render_app

Or from C code:

cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(1, &cpuset);  // Pin to Core 1
pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset);

IRQ Affinity — Controlling Interrupts

Pin hardware interrupts to specific cores so they do not disturb isolated RT tasks:

# Find the IRQ number for SPI (sensor)
cat /proc/interrupts | grep spi

# Pin SPI interrupt to Core 1 (where sensor runs)
echo 0x2 > /proc/irq/42/smp_affinity  # 0x2 = binary 0010 = Core 1

# Pin GPU/display interrupt to Core 0
echo 0x1 > /proc/irq/56/smp_affinity  # 0x1 = binary 0001 = Core 0

IRQ	Pin To	Reason
SPI (sensor)	Core 1 (isolated)	Low-latency wakeup
GPU / DRM	Core 0 (general)	Keep with render thread
USB, Ethernet	Core 0 (general)	Non-RT traffic

Shared State Between Cores

Sensor (Core 1) and Render (Core 0) must exchange data. Use lock-free techniques:

  Core 1 (Sensor)             Core 0 (Render)
  +------------------+        +------------------+
  | Read IMU         |        | Read shared_angle|
  | Filter           |        | Render frame     |
  | atomic_store(    | -----> | using angle      |
  |   shared_angle)  |        | Page flip        |
  +------------------+        +------------------+

// Sensor side (Core 1):
atomic_store(&shared_angle, filtered_angle);

// Render side (Core 0):
float angle = atomic_load(&shared_angle);

No mutex. No lock. No priority inversion. The render thread never blocks the sensor thread.

DMA for Peripheral Access

When sensors are read at high rates or displays need large transfers, the CPU access method matters:

Method	CPU Load (1 kHz IMU)	Latency	Jitter
Polling (busy-wait)	~30-50% of core	Lowest	Lowest
Interrupt-driven	~5-10%	Low	Low
DMA transfer	~1-2%	Medium	Medium

DMA = Direct Memory Access. The hardware moves data without CPU involvement.

When DMA Matters vs Overkill

Use DMA when:

High sample rates (> 1 kHz) where interrupt overhead accumulates
Large block transfers (display framebuffers over SPI)
Bulk sensor reads (accelerometer FIFO burst reads)

Skip DMA when:

Low-rate sensors (temperature every second)
Small transfers (single register reads)
One-shot configuration writes

Rule of thumb: For IMU at 100-500 Hz, interrupt-driven SPI is sufficient. For SPI display at 30 fps, DMA frees the CPU for other work.

DMA Transfer — How It Works

  Without DMA:                    With DMA:
  +-----+     +-----+            +-----+     +-----+
  | CPU |<--->| SPI |            | CPU |     | SPI |
  |     |     | dev |            |     |     | dev |
  +-----+     +-----+            +--+--+     +--+--+
  CPU reads each byte                |           |
  one at a time.                     |  +-----+  |
  CPU is 100% busy                   |  | DMA |  |
  during transfer.                   |  | eng.|  |
                                     |  +--+--+  |
                                     |     |     |
                                  1.Setup  2.Transfer
                                  (CPU)    (no CPU)
                                           3.IRQ done

CPU sets up the transfer, then is free to do other work. DMA engine handles the byte-by-byte movement.

cyclictest — Measuring Scheduling Latency

cyclictest measures the time between when a thread should wake and when it actually wakes.

# Measure scheduling latency on Core 3:
sudo cyclictest -t1 -p99 -a3 -i1000 -l100000

Flag	Meaning
`-t1`	One thread
`-p99`	RT priority 99 (highest)
`-a3`	Pin to CPU 3
`-i1000`	1000 us interval (1 kHz)
`-l100000`	100,000 loops

Output: min, avg, max latency. The max is what matters.

Running cyclictest Under Load

Always test under stress. Idle system latency is meaningless.

# Terminal 1: Generate CPU + I/O stress
stress-ng --cpu 4 --io 2 --vm 1 \
    --vm-bytes 128M --timeout 120s &

# Terminal 2: Measure while stressed
sudo cyclictest -t1 -p99 -a3 -i1000 \
    -l100000 -h400 > histogram.txt

The -h400 flag creates a histogram with 400 us buckets. This tells you not just the max, but the shape of the latency distribution.

Reading a cyclictest Histogram

  Latency (us)   |  Count
 ----------------+-------------------------------------------
      0 -  10    |  ################################  89,200
     10 -  20    |  #####                              4,100
     20 -  50    |  ###                                2,500
     50 - 100    |  ##                                 1,800
    100 - 200    |  |                                    350
    200+         |  .                                     50

Good histogram: tall and narrow. Most samples cluster near minimum. Short tail.

The tail determines your worst-case guarantee. A single spike at 500 us means your guarantee is 500 us, regardless of the 50 us average.

Good vs Bad Histograms

  GOOD (PREEMPT_RT + isolcpus):       BAD (Standard kernel, shared cores):

  ################################     ########
  ###                                  ######
  ##                                   #####
  #                                    ####
  |                                    ###
                                       ##
                                       #
                                       |  <-- long tail = missed deadlines
  |---|---|---|---|---|---|---|          |---|---|---|---|---|---|---|---|---|
  0  20  40  60  80 100    us          0  50 100 200 400 800 1600    us

Good: tight cluster, short tail. Predictable. Bad: wide spread, long tail. Cannot make guarantees.

The Prediction Question

Which change helps more for a sensor-display system?

A) Switching from standard kernel to PREEMPT_RT kernel B) Keeping standard kernel but adding CPU isolation (isolcpus)

Think about it before the next slide.

Testing the Prediction

Four configurations to test with cyclictest:

Config	Kernel	CPU Isolation	Expected Max Latency
1	Standard	None	~1-10 ms
2	Standard	`isolcpus=3`	~0.5-2 ms
3	PREEMPT_RT	None	~50-150 us
4	PREEMPT_RT	`isolcpus=3`	~20-80 us

# Config 1: Standard, no isolation
sudo cyclictest -t1 -p99 -a0 -i1000 -l100000

# Config 4: RT kernel + isolated core
sudo cyclictest -t1 -p99 -a3 -i1000 -l100000

Interpreting the Results

  Max latency (us, log scale):

  Config 1: |======================================| ~5000 us
  Config 2: |==============|                         ~1500 us
  Config 3: |==|                                     ~120 us
  Config 4: |=|                                      ~50 us

Answer: PREEMPT_RT helps far more than CPU isolation alone.

RT kernel reduces worst case by ~40x (5000 -> 120 us)
CPU isolation reduces it by ~3x (5000 -> 1500 us)
Combined gives the best result (~50 us)

CPU isolation is good. RT kernel is essential. Use both.

Why RT Kernel Wins

CPU isolation removes scheduler contention but does not fix:

Non-preemptible interrupt handlers (can block for ms)
Spinlock-held critical sections
RCU callback storms

PREEMPT_RT fixes all three:

Interrupt handlers become schedulable threads
Spinlocks become sleeping mutexes
RCU callbacks are offloaded

Isolation reduces competition. RT kernel reduces non-preemptible time. They address different problems.

The Complete Architecture

  +-----------------------------------------------------------+
  |  KERNEL: PREEMPT_RT   BOOT: isolcpus=1-3 nohz_full=1-3   |
  +-----------------------------------------------------------+
  |                                                             |
  |  Core 0 (General)              Core 1 (Isolated, RT)       |
  |  +------------------------+    +------------------------+  |
  |  | Render thread (normal) |    | Sensor thread (p=80)   |  |
  |  | DRM/KMS page flip      |<---| atomic_store(angle)    |  |
  |  | GPU + display IRQs     |    | SPI IRQ pinned here    |  |
  |  | systemd, networking    |    | Timer tick disabled     |  |
  |  +------------------------+    +------------------------+  |
  |                                                             |
  |  Core 2-3 (Isolated, RT)                                   |
  |  +------------------------------------------------------+  |
  |  | Data logger, control, spare RT tasks                  |  |
  |  +------------------------------------------------------+  |
  +-----------------------------------------------------------+

Profiling Your Own Code

You know how to measure jitter with cyclictest. But when your project is slow, you need to find where in your code the time goes.

Three tools, three questions:

Tool	Question	Use When
`mpstat -P ALL 1`	Which cores are busy?	Always — first tool
`perf record -g -p PID`	Which functions eat CPU?	CPU-bound (high %)
`strace -c -p PID`	Which syscalls block?	I/O-bound (low CPU, slow)

Live Demo: perf record → perf report

# Step 1: Start your app
sudo SDL_VIDEODRIVER=kmsdrm ./level_sdl2 &

# Step 2: Record 10 seconds of CPU samples
sudo perf record -g -p $(pgrep level_sdl2) -- sleep 10

# Step 3: See where time goes
sudo perf report

  Overhead  Symbol
  --------  ------------------------------------------
    32.50%  render_horizon     ← hot function
    18.20%  memcpy             ← unnecessary copies?
    12.80%  imu_read_spi       ← expected
     9.40%  SDL_RenderPresent  ← flip/VSync wait

Rule: Optimize the top line first. Ignore anything below 5%.

Live Demo: strace -c

When CPU is low but app is still slow — something is blocking:

sudo strace -c -p $(pgrep my_app)

  % time     seconds  usecs/call     calls  syscall
  ------ ----------- ----------- --------- --------
   45.32    0.234000         234      1000  read      ← sensor blocking
   22.10    0.114000          57      2000  ioctl     ← DRM calls
    8.50    0.043900          44      1000  nanosleep ← idle (good)

If read() dominates → sensor I/O is blocking the render loop. Fix: Move sensor read to a separate thread.

The Profiling Workflow

Always the same five steps:

1. MEASURE    "How slow?"           FPS counter, mpstat
       ↓
2. PROFILE    "Where?"              perf or strace
       ↓
3. IDENTIFY   "Why?"                Blocking I/O? Copies? VSync?
       ↓
4. FIX        "One change"          Move thread, remove stage, pin core
       ↓
5. VERIFY     "Did it help?"        Same measurement as step 1

The most common mistake: jumping to step 4 without steps 1-3.

Optimization Priority — Cheapest First

#	Optimization	Effort	Example
1	Remove unnecessary work	Minutes	Disable morphology in ball detection
2	Fix architecture	Hours	Move sensor off render thread
3	Use the right API	Hours	DRM page flip instead of fbdev memcpy
4	Core partitioning	Minutes	`taskset -c 1` for sensor thread
5	DMA for peripherals	Days	SPI DMA for high-rate sensors
6	Cache / SIMD	Days	NEON intrinsics, struct packing

Do not start at #5. Most slow projects need #1 or #2.

Reference: Performance Profiling — full tool guide with examples.

Class Exercise: Profile the Ball Detection Pipeline

Open two terminals on the Pi:

# Terminal 1: Run ball detection with FPS display
python3 ball_detection.py

# Terminal 2: Profile it
sudo perf record -g -p $(pgrep python3) -- sleep 10
sudo perf report

Questions to answer: 1. Which OpenCV function takes the most CPU time? 2. If you disable morphology (m key), how does the profile change? 3. What is the FPS before and after? 4. Draw the pipeline with timing per stage.

Mini Exercise — Estimate Your Latency

You have a Pi 4 with PREEMPT_RT. IMU at 200 Hz over SPI, HDMI display at 60 fps, DRM/KMS double buffering.

Stage	Best Case	Worst Case
SPI read (14 bytes @ 1 MHz)	? ms	? ms
Filter (3-tap FIR)	? ms	? ms
Render (SDL2 software)	? ms	? ms
VSync wait	? ms	? ms
Scan-out to center	? ms	? ms
Total	? ms	? ms

Fill in your estimates. We will compare answers.

Mini Exercise — Reference Answers

Stage	Best Case	Worst Case	Reasoning
SPI read	0.1 ms	0.2 ms	14 bytes x 8 bits / 1 MHz
Filter (3-tap FIR)	0.01 ms	0.05 ms	3 multiplies + adds
Render (SDL2)	2 ms	5 ms	Software blit
VSync wait	0 ms	16.7 ms	Alignment luck
Scan-out to center	0 ms	8.3 ms	Half of 16.7 ms
Total	~2 ms	~30 ms	Nearly 2 frames

The VSync wait is the wild card. Everything else is bounded and small.

Quick Checks

What causes tearing, and how does double buffering prevent it?
Why is DRM/KMS preferred over fbdev for real-time display?
What is the worst-case VSync wait at 60 Hz?
Does PREEMPT_RT always help display performance? Why or why not?
Why does CPU isolation alone not match the benefit of an RT kernel?

Key Takeaways

Tearing is a display pipeline problem solved by VSync-aligned page flipping, not by faster rendering.
Total input-to-display latency is the sum of every pipeline stage; the VSync wait often dominates.
PREEMPT_RT improves sensor determinism but can introduce micro-delays in the GPU/DRM path.
Core partitioning (isolcpus) separates deterministic sensor work from best-effort display work.
RT kernel + CPU isolation together give the best result. RT kernel alone helps more than isolation alone.
Use lock-free shared state (atomic variables) between sensor and render threads.

Hands-On Next

Three labs connect to this theory:

DRM/KMS Test Use page flipping and VSync events to build a tear-free display application.

Display Applications Build interactive sensor-driven displays with SDL2 and DRM/KMS.

PREEMPT_RT Latency Measure jitter with and without core isolation. Build histograms. Compare the four configurations.