Lesson 10: Real-Time Systems

Óbuda University — Linux in Embedded Systems

Problem First

You have a sensor driver that reads the IMU at 200 Hz — but sometimes a read takes 50 ms instead of 5 ms.

The display stutters. The control loop misses its deadline. The motor overshoots.

Can Linux do real-time?

It depends. Standard Linux was never designed for guaranteed response times. But with the right patches and architecture, it can get close — and when it cannot, there are well-established patterns for moving time-critical work elsewhere.

Today's Map

Block 1 (45 min): "Predictable, not fast", RT categories, where latency hides, threaded IRQs, sleeping spinlocks, priority inheritance, heterogeneous SoCs, cyclictest demo.
Block 2 (45 min): Latency budget exercise: control loop pipeline, budget calculation, 1 kHz example, choose approach, histogram analysis.

Predictable, Not Fast

Real-time does not mean fast.

A system that always responds in 10 ms is more real-time than one that usually responds in 1 ms but sometimes takes 100 ms.

Predictability is the metric, not speed.

A 1 MHz processor with deterministic timing beats a 4 GHz processor with unpredictable scheduling — for real-time purposes.

The question is never "how fast on average?" but "how slow in the worst case?"

RT Categories

Category	Example	Deadline	Consequence of Miss
Hard RT	Flight controller, ABS	0 tolerance	Crash, injury
Firm RT	Audio, motor control	Small tolerance	Glitch, vibration
Soft RT	Video, UI refresh	Flexible	Frame drop, lag
Best-effort	Downloads, logging	None	Slower experience

Most embedded Linux applications fall into firm or soft real-time.

If you need hard real-time, you almost certainly need a dedicated MCU or RTOS.

Standard Linux Isn't RT

Standard Linux is designed for servers and desktops where the goal is to maximize total throughput — move as many bytes, serve as many requests, compile as many files as possible.

Real-time needs the opposite: respond to each individual event within a guaranteed time bound.

These are fundamentally different design goals:

Throughput-oriented: batch work, defer interrupts, coalesce I/O
Latency-oriented: respond immediately, never defer, never batch

The default Linux scheduler optimizes for fairness and throughput, not for worst-case latency.

Where Latency Hides

In standard Linux, unpredictable latency comes from places where the kernel cannot be interrupted:

Inside interrupt handlers — non-preemptible, blocks everything
Inside spinlock-protected critical sections — preemption disabled
During RCU callbacks — deferred work that must complete
Memory allocation — may trigger page reclaim
Page faults — may require disk I/O

Each of these can add milliseconds of uncontrollable delay. They are invisible to userspace — your task simply does not get scheduled.

Cache Hierarchy and Latency

Memory access time varies by 100x depending on where data lives:

Level	Typical Size	Latency	Analogy
L1 cache	32-64 KB	~1 ns	Book on your desk
L2 cache	256 KB - 1 MB	~5 ns	Bookshelf in your office
L3 cache	2-8 MB	~20 ns	Library down the hall
DRAM	1-8 GB	~100 ns	Warehouse across town

A single cache miss in a 1 kHz control loop adds 100 ns — harmless.

1,000 cache misses add 100 us — that is 10% of your 1 ms period, and your latency budget is blown.

Cache behavior is invisible to your source code but dominates real-time performance. You cannot optimize what you do not measure.

Working Set and Cache Residency

Your control loop's working set = code + data it touches each iteration.

If Working Set Fits In	Expected Behavior
L1 cache (32 KB)	Best case — every access ~1 ns
L2 cache (256 KB)	Good — occasional 5 ns penalties
L3 / DRAM	Bad — frequent 20-100 ns stalls

How to keep your loop in cache:

Keep data structures compact — avoid bloated structs with unused fields
Access memory sequentially — arrays beat linked lists for cache prefetch
CPU isolation (isolcpus) — no other process runs on your core, so nothing evicts your cache lines
Lock pages — mlockall(MCL_CURRENT | MCL_FUTURE) prevents page faults

The goal: your 1 kHz loop should be a "hot" resident in L1/L2, never evicted between iterations.

Two threads on different cores write to adjacent memory locations:

  Core 0                    Core 1
  ┌──────────┐             ┌──────────┐
  │ writes A │             │ writes B │
  └─────┬────┘             └─────┬────┘
        │                        │
        ▼                        ▼
  ┌──────────────────────────────────┐
  │  Cache Line (64 bytes)           │
  │  [ A ][ B ][ ... unused ... ]    │
  └──────────────────────────────────┘
        ↕ bounces between cores ↕

A and B are independent variables, but they share a 64-byte cache line. Each write invalidates the other core's copy → 100x slowdown.

Fix: Align critical data to cache-line boundaries:

struct __attribute__((aligned(64))) {
    volatile int sensor_value;
} core0_data;

This ensures each core's data occupies its own cache line — no sharing, no bouncing.

The PREEMPT_RT Idea

A set of kernel patches — merged into mainline as of Linux 6.12 — that reshapes internal locking and interrupt handling.

Goal: make scheduling more deterministic.

PREEMPT_RT does not make Linux faster. It makes Linux more predictable.

Three key mechanisms:

Threaded interrupts
Sleeping spinlocks
Priority inheritance

Mechanism 1 — Threaded Interrupts

Before (standard Linux):

Hardware IRQ handlers run at the highest priority. They cannot be preempted. A long interrupt handler blocks everything.

After (PREEMPT_RT):

IRQ handlers become kernel threads with configurable priorities. A high-priority RT task can preempt an interrupt handler.

This means your 500 Hz control loop can have higher priority than the network card's interrupt handler — something impossible in standard Linux.

Mechanism 2 — Sleeping Spinlocks

Before (standard Linux):

Spinlocks disable preemption. While any kernel code holds a spinlock, no task can be scheduled — even if a higher-priority task is ready.

After (PREEMPT_RT):

Most spinlocks are converted to RT mutexes that allow sleeping. Critical sections become preemptible.

The result: a high-priority task waiting to run is no longer blocked by an unrelated low-priority kernel path holding a lock.

Mechanism 3 — Priority Inheritance

The problem: Priority inversion.

A high-priority task (H) blocks on a mutex held by a low-priority task (L). A medium-priority task (M) preempts L. Now H is blocked behind M — even though H has higher priority.

The solution: When H blocks on L's mutex, L temporarily inherits H's priority. L runs at high priority until it releases the mutex, then drops back.

This prevents unbounded priority inversion — the scenario that famously caused the Mars Pathfinder reset in 1997.

What PREEMPT_RT Doesn't Fix

PREEMPT_RT improves the kernel's scheduling determinism. It does not fix:

Hardware latency — DMA transfers, cache misses, memory bus contention
GPU scheduling — the display pipeline has its own timing
Worst-case interrupt latency floor — ~10-50 us, set by hardware
Broken driver code — a driver that disables interrupts for 2 ms
Userspace that doesn't use RT priorities — SCHED_FIFO / SCHED_RR

PREEMPT_RT gives you the tools. You still have to use them correctly.

Latency Comparison

Approach	Worst-Case Latency	Certification	Complexity
Standard Linux	~1-10 ms	None	Low
PREEMPT_RT (default)	~50-200 us	Possible (IEC 62443)	Medium
PREEMPT_RT + isolcpus	~20-80 us	Possible (IEC 62443)	Medium
Xenomai (dual-kernel)	~5-15 us	Possible	High
Bare-metal / RTOS	~1 us	IEC 61508 possible	App-dependent

Each step down trades ecosystem richness (networking, filesystems, UI) for tighter timing guarantees.

The right choice depends on your deadline, not on what sounds most impressive.

When PREEMPT_RT Is Not Enough

PREEMPT_RT will not satisfy:

Sub-10 us requirements — kernel complexity creates irreducible jitter
Safety-critical certification (IEC 61508 SIL 2+) — Linux is too complex to formally verify
Control loops at 10 kHz+ — 100 us period leaves no room for kernel overhead

The kernel's complexity means there are always edge cases where latency exceeds the budget.

Solution: move time-critical work off Linux.

Three Off-Linux Options

1. External MCU (Arduino, STM32 via UART/SPI) Linux configures and logs. MCU runs the RT loop. Simple, proven, easy to debug.

2. Heterogeneous SoC (STM32MP1, i.MX8M) One chip, two cores — Linux on Cortex-A, RTOS on Cortex-M. Shared memory for communication. Lower BOM cost than separate MCU.

3. FPGA Hardware-level determinism. Sub-us response. Used in motion control, high-speed data acquisition. Highest development effort.

Heterogeneous SoC Pattern

  +----------------------------------------------------+
  |                   SoC / Board                       |
  |                                                     |
  |  +-------------------+    +---------------------+  |
  |  |  Linux (Cortex-A) |    |  RTOS (Cortex-M)    |  |
  |  |                   |    |                      |  |
  |  |  UI / Dashboard   |    |  Sensor Read         |  |
  |  |  Data Logger      |<-->|  Control Algorithm   |  |
  |  |  Configuration    |    |  Actuator Drive      |  |
  |  |                   |    |                      |  |
  |  +-------------------+    +---------------------+  |
  |         ^                        ^                  |
  |         |   Shared Memory /      |                  |
  |         |   UART / SPI           |                  |
  |         +------------------------+                  |
  +----------------------------------------------------+

Linux handles what it is good at (networking, UI, storage). The RTOS handles what requires determinism (sensing, control, actuation).

The Design Rule

Linux is the SUPERVISOR — configure, monitor, log, display. The MCU is the WORKER — read sensors, compute control, drive actuators.

Clean separation of concerns.

Communication is simple: setpoints flow down (Linux to MCU), telemetry flows up (MCU to Linux). The MCU never waits for Linux to respond.

If the Linux side crashes, the MCU can enter a safe state independently.

ROS 2 — When You Need Middleware (Brief)

What it is: DDS-based publish/subscribe framework. Nodes, topics, QoS profiles. micro-ROS extends it to MCUs.

When to use: multi-node robotic systems, standard sensor/actuator interfaces, teams that need a common framework.

When it is overkill: single-board sensor-to-display pipeline, simple control loops, resource-constrained systems.

Not covered further in this course — but worth knowing it exists if you move into robotics.

CPU Isolation: How It Works

isolcpus=3 removes core 3 from the CFS scheduler. No normal process will be scheduled there — only tasks you explicitly pin.

But the kernel still intrudes with timers and RCU callbacks. Remove those too:

Parameter	What It Removes
`isolcpus=3`	CFS scheduler — no normal tasks
`nohz_full=3`	Timer tick — no periodic interrupts
`rcu_nocbs=3`	RCU callbacks — deferred work moved to other cores

After all three: Core 3 is a "bare metal" core inside Linux. Only your pinned RT task runs there.

Without nohz_full: The timer tick fires every 1-4 ms, adding ~1-5 us of jitter each time. For a 1 kHz loop, that is a jitter spike every 1-4 iterations.

CPU Isolation: Practical Setup

Full kernel command line:

isolcpus=3 nohz_full=3 rcu_nocbs=3

Then pin your RT task to the isolated core:

# Pin to core 3, SCHED_FIFO priority 90
taskset -c 3 chrt -f 90 ./control_loop

Before/after cyclictest comparison:

Configuration	Max Latency	99.9th %ile
PREEMPT_RT, no isolation	~120 us	~80 us
PREEMPT_RT + `isolcpus=3`	~60 us	~40 us
PREEMPT_RT + full isolation	~35 us	~25 us

Each layer of isolation removes a source of jitter. Full isolation cuts worst-case latency by ~3x compared to PREEMPT_RT alone.

These numbers are representative for a Raspberry Pi 4 under stress-ng load. Your hardware will differ — always measure.

Try It Now: Check CPU Isolation (5 min)

Verify your kernel's isolation settings and pin a process to an isolated core:

# Check current kernel command line for isolation parameters
cat /proc/cmdline

# See which CPUs the scheduler uses
cat /sys/devices/system/cpu/online

# Pin a process to a specific core
taskset -c 3 echo "Running on core 3"

# Check which core a process runs on
taskset -p $$

Is isolcpus in your command line? What happens when you taskset to an isolated core?

Tutorial: PREEMPT_RT Latency — Section 3: CPU Isolation Theory: Section 4: CPU Isolation

cyclictest Demo

cyclictest measures scheduling latency: the time between when a thread should wake and when it actually wakes.

# Measure scheduling latency:
# -t1     one thread
# -p99    RT priority 99
# -a3     pin to CPU 3
# -i1000  1000 us interval (1 kHz)
# -l100000  100,000 loops
sudo cyclictest -t1 -p99 -a3 -i1000 -l100000

Key output: min, avg, max latency.

The max is what matters. Everything else is noise.

Try It Now: cyclictest (10 min)

Run a baseline latency measurement, then add stress and compare:

sudo cyclictest -t1 -p99 -a3 -i1000 -l10000        # quiet system
stress-ng --cpu 4 --io 2 --timeout 30s &
sudo cyclictest -t1 -p99 -a3 -i1000 -l10000        # under load

Compare the max values. How much worse is the loaded result?

Tutorial: PREEMPT_RT Latency — Section 2: Baseline Tutorial: Jitter Measurement — Section 2: Baseline Test Theory: Section 6: Debugging and Validating RT

ftrace and trace-cmd

Trace kernel events: function calls, scheduling switches, interrupt handlers.

# Record scheduling events for 10 seconds
sudo trace-cmd record -p function_graph \
    -e sched_switch sleep 10

# View the trace
sudo trace-cmd report | head -50

Use this to find what is causing latency — which function, which driver, which interrupt handler is holding the CPU too long.

ftrace is built into the kernel. No extra packages needed.

Try It Now: Function Trace (5 min)

Use ftrace to see what the kernel does during a scheduling event:

# Enable function tracing
echo function > /sys/kernel/debug/tracing/current_tracer

# Filter for scheduler functions
echo 'schedule' > /sys/kernel/debug/tracing/set_ftrace_filter

# Read a few lines of trace output
cat /sys/kernel/debug/tracing/trace | head -20

# Disable tracing when done
echo nop > /sys/kernel/debug/tracing/current_tracer

Can you spot context switches in the trace? Which process was scheduled?

Tutorial: Jitter Measurement — Section 4: Tracing Theory: Section 6: Debugging and Validating RT

Test Methodology — Never Report Average

Average latency means nothing if the worst case is 10 ms.

Rules for RT measurement: - Always test under load (use stress-ng) - Report 99.9th percentile (or maximum) - Run for at least 100,000 cycles

# Generate CPU + I/O stress
stress-ng --cpu 4 --io 2 --vm 1 \
    --vm-bytes 128M --timeout 60s &

# Measure under stress
sudo cyclictest -t1 -p99 -a3 -i1000 \
    -l100000 -h400 > histogram.txt

"Average latency is 50 us" is not useful. "99.9th percentile is 180 us" is useful.

Hardware Counters with perf

perf reads CPU hardware performance counters — cache misses, context switches, CPU migrations:

$ sudo perf stat -e cache-misses,context-switches,cpu-migrations \
    taskset -c 3 chrt -f 90 ./control_loop

 Performance counter stats:
     2,341    cache-misses
        12    context-switches
         0    cpu-migrations

     10.001 seconds time elapsed

Interpreting the results:

Counter	Good	Bad	What To Do
cache-misses	< 1,000/s	> 100,000/s	Reduce working set, fix access patterns
context-switches	< 10/s	> 1,000/s	Check priorities, isolate CPU
cpu-migrations	0	> 0	Pin with `taskset`

"10,000 cache misses per second is fine. 10,000 context switches per second is not."

DVFS and Latency

CPU frequency governors dynamically adjust clock speed to save power. Each transition takes 1-2 ms.

If your control loop runs during a frequency transition, it experiences a latency spike.

Fix for RT workloads: Lock the governor to maximum frequency on the isolated core:

echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor

Governor	Behavior	RT Impact
`ondemand`	Scale up on load, scale down on idle	1-2 ms spikes during transitions
`powersave`	Always minimum frequency	Slow but stable — no transitions
`performance`	Always maximum frequency	No transitions — best for RT
`schedutil`	Kernel scheduler-driven	Better than ondemand, still transitions

Trade-off: performance governor wastes power on an idle core. Acceptable for a mains-powered system, not for battery. Device tree can set the default governor per CPU.

Block 1 Summary

RT = predictable, not fast
PREEMPT_RT: threaded IRQs + sleeping spinlocks + priority inheritance
Result: ~20-80 us worst-case scheduling latency
Below ~10 us: use external MCU or heterogeneous SoC
Always validate under load; report worst case, never average

Block 2 — Latency Budget Exercise

"Design a 500 Hz Controller"

The Control Loop Pipeline

  +----------+    +----------+    +----------+    +----------+
  |  Sensor  |--->|  Filter  |--->| Control  |--->| Actuator |
  |   Read   |    |          |    |   Law    |    |  Write   |
  +----------+    +----------+    +----------+    +----------+
       |                                                |
       +------------------------------------------------+
                       Next cycle

Each stage has a maximum execution time. The sum of all stages must be less than the loop period.

If any single iteration exceeds the period, you miss the deadline.

The Budget Concept

Total available time = 1 / frequency.

For 500 Hz: period = 2 ms = 2000 us.

Every stage of the pipeline must complete — plus OS overhead — within that 2000 us.

If you exceed it, the control loop misses its deadline. The motor gets stale commands. The system becomes unstable.

A latency budget is like a financial budget: allocate before you spend.

1 kHz Example Budget

Stage	Budget	Measured	Margin
IMU read (SPI)	200 us	150 us	50 us
Kalman filter	300 us	220 us	80 us
PID compute	100 us	60 us	40 us
PWM write	100 us	80 us	20 us
Subtotal	700 us	510 us	190 us
OS overhead + slack	300 us	—	—
Total period	1000 us

The margin column tells you where you have room — and where you do not.

The 70% Rule

If your measured total exceeds 70% of the period, you have insufficient margin for worst-case jitter.

510 / 1000 = 51% --- OK, healthy margin
750 / 1000 = 75% --- danger zone, spikes will miss deadlines

Margin absorbs cache misses, page faults, and interrupt storms. Industry experience shows 70% utilization keeps worst-case within budget for most embedded systems.

Cache misses
Kernel scheduling jitter
DMA bus contention
Interrupt storms

Design for the worst case, not the average case.

Exercise Step 1 — Calculate the Period

Your robotic arm must update motor positions at 500 Hz.

Period = 1 / 500 = 2 ms = 2000 us

This is your total time budget. Everything — sensor read, computation, actuator write, and OS overhead — must fit inside 2000 us.

Every. Single. Cycle.

Exercise Step 2 — Allocate Budgets

Fill in the table for your 500 Hz robotic arm:

Stage	Your Budget (us)
Sensor read (SPI IMU)	?
Filter (complementary)	?
PID compute	?
Motor PWM write	?
Subtotal	?
OS overhead + slack	?
Total = 2000 us

Rule: subtotal must be ≤ 70% of period = ≤ 1400 us

The remaining 600 us absorbs OS jitter, cache misses, and scheduling delays.

Exercise Step 3 — Choose Your Approach

Given your budget, which approach fits?

PREEMPT_RT — worst case ~50 us OS jitter Does your 600 us slack absorb this? Yes, easily.

Standard Linux — worst case ~5 ms jitter 5000 us > 2000 us period. Impossible. A single jitter spike exceeds the entire period.

External MCU — guaranteed <10 us overhead Always works, but adds hardware complexity and communication latency.

Which do you choose?

Exercise Step 4 — Justify Your Choice

Write 3-4 sentences answering:

What is the loop rate and period?
What is the timing margin after allocating budgets?
Which approach did you choose and why?
What is the risk if your margin is too small?

Example: "The loop runs at 500 Hz (2 ms period). My pipeline subtotal is 1200 us, leaving 800 us for OS overhead. I chose PREEMPT_RT because its ~50 us worst-case jitter fits within my 800 us margin. If margin were smaller (<100 us), I would move to an external MCU."

Latency Histogram — What a Good Result Looks Like

  Latency (us)   |  Count
 ----------------+--------------------------------------------
      0 -  10    |  ████████████████████████████████  89,200
     10 -  20    |  █████                              4,100
     20 -  50    |  ███                                2,500
     50 - 100    |  ██                                 1,800
    100 - 200    |  ▌                                    350
    200+         |  ▏                                     50

A good histogram is tall and narrow: most samples cluster near the minimum, with a very short tail. The tail is what determines your worst-case guarantee.

Good vs Bad Histograms

  GOOD (PREEMPT_RT):             BAD (Standard Linux):

  ████████████████████            ██████
  ███                             █████
  ██                              ████
  █                               ███
  ▌                               ██
                                  █
                                  █
                                  ▌  <-- long tail = missed deadlines

Good: tight cluster, short tail. The 99.9th percentile is close to the average.

Bad: wide spread, long unpredictable tail. You cannot make guarantees.

Debugging Recipe

When latency is too high:

Measure — run cyclictest under stress-ng load
Check priorities — use chrt to verify RT scheduling
```
chrt -f -p 99 $(pidof my_control_loop)
```
Trace — use ftrace to find the longest non-preemptible section
Isolate CPUs — isolcpus=3 in kernel cmdline, pin RT task to isolated core
Disable noise — turn off unused kernel features (USB, Wi-Fi if not needed)

Work from the outside in: measure first, then trace, then isolate.

What Percentile Matters?

RT Category	Required Percentile	Meaning
Hard RT	100th (every sample)	Zero misses allowed
Firm RT	99.99th	1 miss per 10,000 cycles
Soft RT	99th	1 miss per 100 cycles

The percentile you choose defines your reliability guarantee.

For the exercise: your 500 Hz arm needs the 99.9th percentile within 2 ms — that means at most 1 deadline miss per 1,000 cycles (once every 2 seconds).

Quick Checks

What is the difference between hard and soft real-time?
Name two things PREEMPT_RT changes in the kernel.
Why is average latency misleading for RT systems?
When would you add an external MCU instead of using PREEMPT_RT?
What does cyclictest measure?

Key Takeaways

Real-time means predictable, not fast.
PREEMPT_RT gets Linux to ~50 us worst case — enough for many industrial applications.
Below ~10 us or safety-critical: use external MCU or heterogeneous SoC.
Always validate under load; report worst case (99.9th percentile), never average.
Design with a latency budget — if measured > 70% of period, you need more margin.

Hands-On Next

Two upcoming labs connect to this theory:

PREEMPT_RT: Latency Measurement Measure scheduling latency with cyclictest on standard vs PREEMPT_RT kernels. Build latency histograms. Quantify the difference.

Jitter Measurement Analyze timing jitter in sensor read loops. Measure how consistent your loop timing actually is under load. Compare isolated vs non-isolated CPU cores.