Lesson 10: Real-Time Systems
Óbuda University — Linux in Embedded Systems
Problem First
You have a sensor driver that reads the IMU at 200 Hz — but sometimes a read takes 50 ms instead of 5 ms.
The display stutters. The control loop misses its deadline. The motor overshoots.
Can Linux do real-time?
It depends. Standard Linux was never designed for guaranteed response times. But with the right patches and architecture, it can get close — and when it cannot, there are well-established patterns for moving time-critical work elsewhere.
Today's Map
- Block 1 (45 min): "Predictable, not fast", RT categories, where latency hides, threaded IRQs, sleeping spinlocks, priority inheritance, heterogeneous SoCs,
cyclictestdemo. - Block 2 (45 min): Latency budget exercise: control loop pipeline, budget calculation, 1 kHz example, choose approach, histogram analysis.
Predictable, Not Fast
Real-time does not mean fast.
A system that always responds in 10 ms is more real-time than one that usually responds in 1 ms but sometimes takes 100 ms.
Predictability is the metric, not speed.
A 1 MHz processor with deterministic timing beats a 4 GHz processor with unpredictable scheduling — for real-time purposes.
The question is never "how fast on average?" but "how slow in the worst case?"
RT Categories
| Category | Example | Deadline | Consequence of Miss |
|---|---|---|---|
| Hard RT | Flight controller, ABS | 0 tolerance | Crash, injury |
| Firm RT | Audio, motor control | Small tolerance | Glitch, vibration |
| Soft RT | Video, UI refresh | Flexible | Frame drop, lag |
| Best-effort | Downloads, logging | None | Slower experience |
Most embedded Linux applications fall into firm or soft real-time.
If you need hard real-time, you almost certainly need a dedicated MCU or RTOS.
Standard Linux Isn't RT
Standard Linux is designed for servers and desktops where the goal is to maximize total throughput — move as many bytes, serve as many requests, compile as many files as possible.
Real-time needs the opposite: respond to each individual event within a guaranteed time bound.
These are fundamentally different design goals:
- Throughput-oriented: batch work, defer interrupts, coalesce I/O
- Latency-oriented: respond immediately, never defer, never batch
The default Linux scheduler optimizes for fairness and throughput, not for worst-case latency.
Where Latency Hides
In standard Linux, unpredictable latency comes from places where the kernel cannot be interrupted:
- Inside interrupt handlers — non-preemptible, blocks everything
- Inside spinlock-protected critical sections — preemption disabled
- During RCU callbacks — deferred work that must complete
- Memory allocation — may trigger page reclaim
- Page faults — may require disk I/O
Each of these can add milliseconds of uncontrollable delay. They are invisible to userspace — your task simply does not get scheduled.
Cache Hierarchy and Latency
Memory access time varies by 100x depending on where data lives:
| Level | Typical Size | Latency | Analogy |
|---|---|---|---|
| L1 cache | 32-64 KB | ~1 ns | Book on your desk |
| L2 cache | 256 KB - 1 MB | ~5 ns | Bookshelf in your office |
| L3 cache | 2-8 MB | ~20 ns | Library down the hall |
| DRAM | 1-8 GB | ~100 ns | Warehouse across town |
A single cache miss in a 1 kHz control loop adds 100 ns — harmless.
1,000 cache misses add 100 us — that is 10% of your 1 ms period, and your latency budget is blown.
Cache behavior is invisible to your source code but dominates real-time performance. You cannot optimize what you do not measure.
Working Set and Cache Residency
Your control loop's working set = code + data it touches each iteration.
| If Working Set Fits In | Expected Behavior |
|---|---|
| L1 cache (32 KB) | Best case — every access ~1 ns |
| L2 cache (256 KB) | Good — occasional 5 ns penalties |
| L3 / DRAM | Bad — frequent 20-100 ns stalls |
How to keep your loop in cache:
- Keep data structures compact — avoid bloated structs with unused fields
- Access memory sequentially — arrays beat linked lists for cache prefetch
- CPU isolation (
isolcpus) — no other process runs on your core, so nothing evicts your cache lines - Lock pages —
mlockall(MCL_CURRENT | MCL_FUTURE)prevents page faults
The goal: your 1 kHz loop should be a "hot" resident in L1/L2, never evicted between iterations.
Cache-Line False Sharing
Two threads on different cores write to adjacent memory locations:
Core 0 Core 1
┌──────────┐ ┌──────────┐
│ writes A │ │ writes B │
└─────┬────┘ └─────┬────┘
│ │
▼ ▼
┌──────────────────────────────────┐
│ Cache Line (64 bytes) │
│ [ A ][ B ][ ... unused ... ] │
└──────────────────────────────────┘
↕ bounces between cores ↕
A and B are independent variables, but they share a 64-byte cache line. Each write invalidates the other core's copy → 100x slowdown.
Fix: Align critical data to cache-line boundaries:
This ensures each core's data occupies its own cache line — no sharing, no bouncing.
The PREEMPT_RT Idea
A set of kernel patches — merged into mainline as of Linux 6.12 — that reshapes internal locking and interrupt handling.
Goal: make scheduling more deterministic.
PREEMPT_RT does not make Linux faster. It makes Linux more predictable.
Three key mechanisms:
- Threaded interrupts
- Sleeping spinlocks
- Priority inheritance
Mechanism 1 — Threaded Interrupts
Before (standard Linux):
Hardware IRQ handlers run at the highest priority. They cannot be preempted. A long interrupt handler blocks everything.
After (PREEMPT_RT):
IRQ handlers become kernel threads with configurable priorities. A high-priority RT task can preempt an interrupt handler.
This means your 500 Hz control loop can have higher priority than the network card's interrupt handler — something impossible in standard Linux.
Mechanism 2 — Sleeping Spinlocks
Before (standard Linux):
Spinlocks disable preemption. While any kernel code holds a spinlock, no task can be scheduled — even if a higher-priority task is ready.
After (PREEMPT_RT):
Most spinlocks are converted to RT mutexes that allow sleeping. Critical sections become preemptible.
The result: a high-priority task waiting to run is no longer blocked by an unrelated low-priority kernel path holding a lock.
Mechanism 3 — Priority Inheritance
The problem: Priority inversion.
A high-priority task (H) blocks on a mutex held by a low-priority task (L). A medium-priority task (M) preempts L. Now H is blocked behind M — even though H has higher priority.
The solution: When H blocks on L's mutex, L temporarily inherits H's priority. L runs at high priority until it releases the mutex, then drops back.
This prevents unbounded priority inversion — the scenario that famously caused the Mars Pathfinder reset in 1997.
What PREEMPT_RT Doesn't Fix
PREEMPT_RT improves the kernel's scheduling determinism. It does not fix:
- Hardware latency — DMA transfers, cache misses, memory bus contention
- GPU scheduling — the display pipeline has its own timing
- Worst-case interrupt latency floor — ~10-50 us, set by hardware
- Broken driver code — a driver that disables interrupts for 2 ms
- Userspace that doesn't use RT priorities —
SCHED_FIFO/SCHED_RR
PREEMPT_RT gives you the tools. You still have to use them correctly.
Latency Comparison
| Approach | Worst-Case Latency | Certification | Complexity |
|---|---|---|---|
| Standard Linux | ~1-10 ms | None | Low |
| PREEMPT_RT (default) | ~50-200 us | Possible (IEC 62443) | Medium |
| PREEMPT_RT + isolcpus | ~20-80 us | Possible (IEC 62443) | Medium |
| Xenomai (dual-kernel) | ~5-15 us | Possible | High |
| Bare-metal / RTOS | ~1 us | IEC 61508 possible | App-dependent |
Each step down trades ecosystem richness (networking, filesystems, UI) for tighter timing guarantees.
The right choice depends on your deadline, not on what sounds most impressive.
When PREEMPT_RT Is Not Enough
PREEMPT_RT will not satisfy:
- Sub-10 us requirements — kernel complexity creates irreducible jitter
- Safety-critical certification (IEC 61508 SIL 2+) — Linux is too complex to formally verify
- Control loops at 10 kHz+ — 100 us period leaves no room for kernel overhead
The kernel's complexity means there are always edge cases where latency exceeds the budget.
Solution: move time-critical work off Linux.
Three Off-Linux Options
1. External MCU (Arduino, STM32 via UART/SPI) Linux configures and logs. MCU runs the RT loop. Simple, proven, easy to debug.
2. Heterogeneous SoC (STM32MP1, i.MX8M) One chip, two cores — Linux on Cortex-A, RTOS on Cortex-M. Shared memory for communication. Lower BOM cost than separate MCU.
3. FPGA Hardware-level determinism. Sub-us response. Used in motion control, high-speed data acquisition. Highest development effort.
Heterogeneous SoC Pattern
+----------------------------------------------------+
| SoC / Board |
| |
| +-------------------+ +---------------------+ |
| | Linux (Cortex-A) | | RTOS (Cortex-M) | |
| | | | | |
| | UI / Dashboard | | Sensor Read | |
| | Data Logger |<-->| Control Algorithm | |
| | Configuration | | Actuator Drive | |
| | | | | |
| +-------------------+ +---------------------+ |
| ^ ^ |
| | Shared Memory / | |
| | UART / SPI | |
| +------------------------+ |
+----------------------------------------------------+
Linux handles what it is good at (networking, UI, storage). The RTOS handles what requires determinism (sensing, control, actuation).
The Design Rule
Linux is the SUPERVISOR — configure, monitor, log, display. The MCU is the WORKER — read sensors, compute control, drive actuators.
Clean separation of concerns.
Communication is simple: setpoints flow down (Linux to MCU), telemetry flows up (MCU to Linux). The MCU never waits for Linux to respond.
If the Linux side crashes, the MCU can enter a safe state independently.
ROS 2 — When You Need Middleware (Brief)
What it is: DDS-based publish/subscribe framework. Nodes, topics, QoS profiles. micro-ROS extends it to MCUs.
When to use: multi-node robotic systems, standard sensor/actuator interfaces, teams that need a common framework.
When it is overkill: single-board sensor-to-display pipeline, simple control loops, resource-constrained systems.
Not covered further in this course — but worth knowing it exists if you move into robotics.
CPU Isolation: How It Works
isolcpus=3 removes core 3 from the CFS scheduler. No normal process will be scheduled there — only tasks you explicitly pin.
But the kernel still intrudes with timers and RCU callbacks. Remove those too:
| Parameter | What It Removes |
|---|---|
isolcpus=3 |
CFS scheduler — no normal tasks |
nohz_full=3 |
Timer tick — no periodic interrupts |
rcu_nocbs=3 |
RCU callbacks — deferred work moved to other cores |
After all three: Core 3 is a "bare metal" core inside Linux. Only your pinned RT task runs there.
Without nohz_full: The timer tick fires every 1-4 ms, adding ~1-5 us of jitter each time. For a 1 kHz loop, that is a jitter spike every 1-4 iterations.
CPU Isolation: Practical Setup
Full kernel command line:
Then pin your RT task to the isolated core:
Before/after cyclictest comparison:
| Configuration | Max Latency | 99.9th %ile |
|---|---|---|
| PREEMPT_RT, no isolation | ~120 us | ~80 us |
PREEMPT_RT + isolcpus=3 |
~60 us | ~40 us |
| PREEMPT_RT + full isolation | ~35 us | ~25 us |
Each layer of isolation removes a source of jitter. Full isolation cuts worst-case latency by ~3x compared to PREEMPT_RT alone.
These numbers are representative for a Raspberry Pi 4 under
stress-ngload. Your hardware will differ — always measure.
Try It Now: Check CPU Isolation (5 min)
Verify your kernel's isolation settings and pin a process to an isolated core:
# Check current kernel command line for isolation parameters
cat /proc/cmdline
# See which CPUs the scheduler uses
cat /sys/devices/system/cpu/online
# Pin a process to a specific core
taskset -c 3 echo "Running on core 3"
# Check which core a process runs on
taskset -p $$
Is isolcpus in your command line? What happens when you taskset to an isolated core?
Tutorial: PREEMPT_RT Latency — Section 3: CPU Isolation Theory: Section 4: CPU Isolation
cyclictest Demo
cyclictest measures scheduling latency: the time between when a thread should wake and when it actually wakes.
# Measure scheduling latency:
# -t1 one thread
# -p99 RT priority 99
# -a3 pin to CPU 3
# -i1000 1000 us interval (1 kHz)
# -l100000 100,000 loops
sudo cyclictest -t1 -p99 -a3 -i1000 -l100000
Key output: min, avg, max latency.
The max is what matters. Everything else is noise.
Try It Now: cyclictest (10 min)
Run a baseline latency measurement, then add stress and compare:
sudo cyclictest -t1 -p99 -a3 -i1000 -l10000 # quiet system
stress-ng --cpu 4 --io 2 --timeout 30s &
sudo cyclictest -t1 -p99 -a3 -i1000 -l10000 # under load
Compare the max values. How much worse is the loaded result?
Tutorial: PREEMPT_RT Latency — Section 2: Baseline Tutorial: Jitter Measurement — Section 2: Baseline Test Theory: Section 6: Debugging and Validating RT
ftrace and trace-cmd
Trace kernel events: function calls, scheduling switches, interrupt handlers.
# Record scheduling events for 10 seconds
sudo trace-cmd record -p function_graph \
-e sched_switch sleep 10
# View the trace
sudo trace-cmd report | head -50
Use this to find what is causing latency — which function, which driver, which interrupt handler is holding the CPU too long.
ftrace is built into the kernel. No extra packages needed.
Try It Now: Function Trace (5 min)
Use ftrace to see what the kernel does during a scheduling event:
# Enable function tracing
echo function > /sys/kernel/debug/tracing/current_tracer
# Filter for scheduler functions
echo 'schedule' > /sys/kernel/debug/tracing/set_ftrace_filter
# Read a few lines of trace output
cat /sys/kernel/debug/tracing/trace | head -20
# Disable tracing when done
echo nop > /sys/kernel/debug/tracing/current_tracer
Can you spot context switches in the trace? Which process was scheduled?
Tutorial: Jitter Measurement — Section 4: Tracing Theory: Section 6: Debugging and Validating RT
Test Methodology — Never Report Average
Average latency means nothing if the worst case is 10 ms.
Rules for RT measurement:
- Always test under load (use stress-ng)
- Report 99.9th percentile (or maximum)
- Run for at least 100,000 cycles
# Generate CPU + I/O stress
stress-ng --cpu 4 --io 2 --vm 1 \
--vm-bytes 128M --timeout 60s &
# Measure under stress
sudo cyclictest -t1 -p99 -a3 -i1000 \
-l100000 -h400 > histogram.txt
"Average latency is 50 us" is not useful. "99.9th percentile is 180 us" is useful.
Hardware Counters with perf
perf reads CPU hardware performance counters — cache misses, context switches, CPU migrations:
$ sudo perf stat -e cache-misses,context-switches,cpu-migrations \
taskset -c 3 chrt -f 90 ./control_loop
Performance counter stats:
2,341 cache-misses
12 context-switches
0 cpu-migrations
10.001 seconds time elapsed
Interpreting the results:
| Counter | Good | Bad | What To Do |
|---|---|---|---|
| cache-misses | < 1,000/s | > 100,000/s | Reduce working set, fix access patterns |
| context-switches | < 10/s | > 1,000/s | Check priorities, isolate CPU |
| cpu-migrations | 0 | > 0 | Pin with taskset |
"10,000 cache misses per second is fine. 10,000 context switches per second is not."
DVFS and Latency
CPU frequency governors dynamically adjust clock speed to save power. Each transition takes 1-2 ms.
If your control loop runs during a frequency transition, it experiences a latency spike.
Fix for RT workloads: Lock the governor to maximum frequency on the isolated core:
| Governor | Behavior | RT Impact |
|---|---|---|
ondemand |
Scale up on load, scale down on idle | 1-2 ms spikes during transitions |
powersave |
Always minimum frequency | Slow but stable — no transitions |
performance |
Always maximum frequency | No transitions — best for RT |
schedutil |
Kernel scheduler-driven | Better than ondemand, still transitions |
Trade-off: performance governor wastes power on an idle core. Acceptable for a mains-powered system, not for battery. Device tree can set the default governor per CPU.
Block 1 Summary
- RT = predictable, not fast
- PREEMPT_RT: threaded IRQs + sleeping spinlocks + priority inheritance
- Result: ~20-80 us worst-case scheduling latency
- Below ~10 us: use external MCU or heterogeneous SoC
- Always validate under load; report worst case, never average
Block 2 — Latency Budget Exercise
"Design a 500 Hz Controller"
The Control Loop Pipeline
+----------+ +----------+ +----------+ +----------+
| Sensor |--->| Filter |--->| Control |--->| Actuator |
| Read | | | | Law | | Write |
+----------+ +----------+ +----------+ +----------+
| |
+------------------------------------------------+
Next cycle
Each stage has a maximum execution time. The sum of all stages must be less than the loop period.
If any single iteration exceeds the period, you miss the deadline.
The Budget Concept
Total available time = 1 / frequency.
For 500 Hz: period = 2 ms = 2000 us.
Every stage of the pipeline must complete — plus OS overhead — within that 2000 us.
If you exceed it, the control loop misses its deadline. The motor gets stale commands. The system becomes unstable.
A latency budget is like a financial budget: allocate before you spend.
1 kHz Example Budget
| Stage | Budget | Measured | Margin |
|---|---|---|---|
| IMU read (SPI) | 200 us | 150 us | 50 us |
| Kalman filter | 300 us | 220 us | 80 us |
| PID compute | 100 us | 60 us | 40 us |
| PWM write | 100 us | 80 us | 20 us |
| Subtotal | 700 us | 510 us | 190 us |
| OS overhead + slack | 300 us | — | — |
| Total period | 1000 us |
The margin column tells you where you have room — and where you do not.
The 70% Rule
If your measured total exceeds 70% of the period, you have insufficient margin for worst-case jitter.
- 510 / 1000 = 51% --- OK, healthy margin
- 750 / 1000 = 75% --- danger zone, spikes will miss deadlines
Margin absorbs cache misses, page faults, and interrupt storms. Industry experience shows 70% utilization keeps worst-case within budget for most embedded systems.
- Cache misses
- Kernel scheduling jitter
- DMA bus contention
- Interrupt storms
Design for the worst case, not the average case.
Exercise Step 1 — Calculate the Period
Your robotic arm must update motor positions at 500 Hz.
Period = 1 / 500 = 2 ms = 2000 us
This is your total time budget. Everything — sensor read, computation, actuator write, and OS overhead — must fit inside 2000 us.
Every. Single. Cycle.
Exercise Step 2 — Allocate Budgets
Fill in the table for your 500 Hz robotic arm:
| Stage | Your Budget (us) |
|---|---|
| Sensor read (SPI IMU) | ? |
| Filter (complementary) | ? |
| PID compute | ? |
| Motor PWM write | ? |
| Subtotal | ? |
| OS overhead + slack | ? |
| Total = 2000 us |
Rule: subtotal must be ≤ 70% of period = ≤ 1400 us
The remaining 600 us absorbs OS jitter, cache misses, and scheduling delays.
Exercise Step 3 — Choose Your Approach
Given your budget, which approach fits?
PREEMPT_RT — worst case ~50 us OS jitter Does your 600 us slack absorb this? Yes, easily.
Standard Linux — worst case ~5 ms jitter 5000 us > 2000 us period. Impossible. A single jitter spike exceeds the entire period.
External MCU — guaranteed <10 us overhead Always works, but adds hardware complexity and communication latency.
Which do you choose?
Exercise Step 4 — Justify Your Choice
Write 3-4 sentences answering:
- What is the loop rate and period?
- What is the timing margin after allocating budgets?
- Which approach did you choose and why?
- What is the risk if your margin is too small?
Example: "The loop runs at 500 Hz (2 ms period). My pipeline subtotal is 1200 us, leaving 800 us for OS overhead. I chose PREEMPT_RT because its ~50 us worst-case jitter fits within my 800 us margin. If margin were smaller (<100 us), I would move to an external MCU."
Latency Histogram — What a Good Result Looks Like
Latency (us) | Count
----------------+--------------------------------------------
0 - 10 | ████████████████████████████████ 89,200
10 - 20 | █████ 4,100
20 - 50 | ███ 2,500
50 - 100 | ██ 1,800
100 - 200 | ▌ 350
200+ | ▏ 50
A good histogram is tall and narrow: most samples cluster near the minimum, with a very short tail. The tail is what determines your worst-case guarantee.
Good vs Bad Histograms
GOOD (PREEMPT_RT): BAD (Standard Linux):
████████████████████ ██████
███ █████
██ ████
█ ███
▌ ██
█
█
▌ <-- long tail = missed deadlines
Good: tight cluster, short tail. The 99.9th percentile is close to the average.
Bad: wide spread, long unpredictable tail. You cannot make guarantees.
Debugging Recipe
When latency is too high:
- Measure — run
cyclictestunderstress-ngload - Check priorities — use
chrtto verify RT scheduling - Trace — use
ftraceto find the longest non-preemptible section - Isolate CPUs —
isolcpus=3in kernel cmdline, pin RT task to isolated core - Disable noise — turn off unused kernel features (USB, Wi-Fi if not needed)
Work from the outside in: measure first, then trace, then isolate.
What Percentile Matters?
| RT Category | Required Percentile | Meaning |
|---|---|---|
| Hard RT | 100th (every sample) | Zero misses allowed |
| Firm RT | 99.99th | 1 miss per 10,000 cycles |
| Soft RT | 99th | 1 miss per 100 cycles |
The percentile you choose defines your reliability guarantee.
For the exercise: your 500 Hz arm needs the 99.9th percentile within 2 ms — that means at most 1 deadline miss per 1,000 cycles (once every 2 seconds).
Quick Checks
- What is the difference between hard and soft real-time?
- Name two things PREEMPT_RT changes in the kernel.
- Why is average latency misleading for RT systems?
- When would you add an external MCU instead of using PREEMPT_RT?
- What does
cyclictestmeasure?
Key Takeaways
- Real-time means predictable, not fast.
- PREEMPT_RT gets Linux to ~50 us worst case — enough for many industrial applications.
- Below ~10 us or safety-critical: use external MCU or heterogeneous SoC.
- Always validate under load; report worst case (99.9th percentile), never average.
- Design with a latency budget — if measured > 70% of period, you need more margin.
Hands-On Next
Two upcoming labs connect to this theory:
PREEMPT_RT: Latency Measurement
Measure scheduling latency with cyclictest on standard vs PREEMPT_RT kernels. Build latency histograms. Quantify the difference.
Jitter Measurement Analyze timing jitter in sensor read loops. Measure how consistent your loop timing actually is under load. Compare isolated vs non-isolated CPU cores.