Real-Time Systems Design

Goal: Understand when Linux can meet real-time requirements and when you need external help.

1. What Is Real-Time?

Category	Example	Deadline Tolerance	Consequence of Miss
Hard RT	Flight controller, ABS brakes	0 — deadline = system failure	Crash, injury, hardware damage
Firm RT	Audio playback, motor control	Small — occasional miss degrades output	Audible glitch, vibration
Soft RT	Video streaming, UI refresh	Flexible — user notices but system recovers	Frame drop, lag
Best-effort	File download, logging	None — finish eventually	Slower experience

Real-time ≠ fast. It means predictable. A system that always responds in 10 ms is more real-time than one that usually responds in 1 ms but sometimes takes 100 ms.

2. PREEMPT_RT Deep Dive

Standard Linux was designed for servers and desktops, where the goal is to maximize total throughput — process as many requests as possible per second. Real-time systems have a fundamentally different goal: respond to each individual event within a guaranteed time. PREEMPT_RT is a set of kernel patches (now merged into mainline as of Linux 6.12) that reshapes the kernel's internal locking and interrupt handling to make scheduling more predictable. It does not make Linux faster — it makes Linux more deterministic.

The key insight behind PREEMPT_RT is that unpredictable latency in standard Linux comes from places where the kernel cannot be interrupted: inside interrupt handlers, inside spinlock-protected critical sections, and during RCU (Read-Copy-Update) callbacks. PREEMPT_RT converts these non-preemptible sections into preemptible ones, so a high-priority real-time task can interrupt almost anything.

What PREEMPT_RT changes:

Threaded interrupts: Hardware IRQ handlers become kernel threads with priorities — a high-priority RT task can preempt an interrupt handler
Sleeping spinlocks: Critical sections use mutexes instead of spinlocks, allowing preemption inside kernel code
Priority inheritance: If a high-priority task blocks on a mutex held by a low-priority task, the low-priority task temporarily inherits the high priority (avoids priority inversion)

What these mean in practice:

Threaded interrupts: Your 500 Hz control loop can be assigned a higher priority than the network interrupt handler — so an incoming packet burst cannot delay your sensor read.
Sleeping spinlocks: A high-priority RT task is no longer blocked indefinitely while an unrelated low-priority task holds a kernel lock.
Priority inheritance: Prevents unbounded priority inversion — the scenario that caused the Mars Pathfinder spacecraft to reset repeatedly in 1997 when a low-priority task held a lock needed by the high-priority bus controller.

What it does NOT change:

Hardware latency (DMA, cache misses)
GPU scheduling (display pipeline has its own timing)
Worst-case interrupt latency floor (~10-50 us, set by hardware)

Approach	Typical Worst-Case Latency	Certification	Complexity
Standard Linux kernel	~1-10 ms	None	Low
PREEMPT_RT	~20-80 us	Possible (IEC 62443)	Medium
Xenomai/RTAI (dual-kernel)	~5-15 us	Possible	High
Bare-metal / RTOS	~1 us	IEC 61508 possible	Application-dependent

Cache Hierarchy and Latency

Memory access time is often the hidden bottleneck in real-time systems. Modern CPUs use a hierarchy of caches to bridge the gap between CPU speed and DRAM latency:

Level	Typical Size (Cortex-A72)	Access Latency	What Lives Here
L1 cache	32 KB per core	~1 ns	Hot loop code + immediate data
L2 cache	512 KB per core	~5 ns	Working set of active task
L3 cache	1 MB shared	~20 ns	Shared data across cores
DRAM	1-8 GB	~100 ns	Everything else

Working set: For a real-time control loop, the code and data touched in each iteration should fit in L1/L2 cache. If your control loop code is 20 KB and your sensor buffer is 4 KB, the total 24 KB working set fits comfortably in L1 (32 KB). If it spills into L2 or DRAM, latency becomes unpredictable because cache misses add 5-100 ns per access.

Cache-line false sharing: A cache line is typically 64 bytes. If two threads on different cores write to variables that happen to reside in the same 64-byte cache line, the hardware bounces the line between cores on every write — even though the threads are accessing different variables. This causes unexpected latency spikes in RT tasks. The fix is to align shared structures to cache-line boundaries (e.g., __attribute__((aligned(64))) in C).

Live Demo: cyclictest on Your Pi

Run cyclictest on the Raspberry Pi and observe scheduling latency. See PREEMPT_RT Latency Tutorial for the full hands-on.

3. When PREEMPT_RT Is Not Enough

PREEMPT_RT brings Linux into the ~20-80 us worst-case latency range, which is sufficient for many industrial applications: motor control at 1 kHz, sensor sampling at 500 Hz, and audio processing at 48 kHz all fit comfortably. But some applications need tighter guarantees. A safety-critical system that must respond within 5 us, a system that requires formal certification (IEC 61508 SIL 2+), or a control loop running at 10 kHz or faster cannot rely on Linux alone — the kernel's complexity means there are always edge cases where latency exceeds the budget.

For these cases, the solution is to move the time-critical work off Linux entirely and onto dedicated real-time hardware. Linux becomes the supervisor — configuring, monitoring, and logging — while a dedicated MCU or FPGA handles the hard real-time control loop.

For sub-10 us requirements, safety-critical certification (IEC 61508 SIL 2+), or guaranteed hard deadlines, Linux alone cannot help. Solutions:

External MCU (Arduino, STM32 via UART/SPI): Linux configures and logs, MCU runs the RT control loop
Heterogeneous SoC (STM32MP1, i.MX8M): one chip, two cores — Linux on Cortex-A, RTOS on Cortex-M. See Boot Architectures Section 6 for details on heterogeneous boot.
FPGA: hardware-level determinism, sub-microsecond response

Architecture patterns:

graph LR
    subgraph "Linux (Cortex-A)"
        UI[UI / Dashboard]
        LOG[Data Logger]
        CFG[Configuration]
    end
    subgraph "Bridge"
        UART[UART / SPI / Shared Memory]
    end
    subgraph "MCU / Cortex-M"
        SENSE[Sensor Read]
        CTRL[Control Algorithm]
        ACT[Actuator Drive]
    end
    CFG --> UART
    UART --> SENSE
    SENSE --> CTRL
    CTRL --> ACT
    ACT -->|telemetry| UART
    UART --> LOG
    UART --> UI

Design rule: Linux is the supervisor (configure, monitor, log, display). The MCU is the worker (read sensors, compute control, drive actuators).

4. ROS 2 — When You Need Middleware

ROS 2 (Robot Operating System 2) is a DDS-based publish/subscribe middleware for robotic systems:

Nodes: independent processes (sensor node, planner node, motor node)
Topics: named data channels (e.g., /imu/data, /cmd_vel)
QoS profiles: configure reliability, deadline, liveliness per topic
micro-ROS: lightweight ROS 2 for MCUs (runs on FreeRTOS/Zephyr, communicates with Linux host)

When to use: Multi-node robotic systems, teams developing different subsystems, need for standard interfaces. When overkill: Single-board sensor-to-display pipeline (like our level display), simple two-device setups.

Info

ROS 2 is not covered further in this course. If your project involves multi-node robotics, see ros.org for tutorials.

5. RT Control Loop Design

A real-time control loop follows the Sensor → Filter → Control → Actuator pipeline:

graph LR
    S[Sensor Read<br/>Budget: 2 ms] --> F[Filter<br/>Budget: 1 ms]
    F --> C[Control Law<br/>Budget: 0.5 ms]
    C --> A[Actuator Write<br/>Budget: 0.5 ms]
    A -->|Next cycle| S
    style S fill:#e1f5fe
    style F fill:#fff3e0
    style C fill:#e8f5e9
    style A fill:#fce4ec

Latency budget: Each stage has a maximum execution time. The sum must be less than the loop period.

Example: Robot arm with 1 kHz control loop (1 ms period):

Stage	Budget	Measured	Margin
IMU read (SPI)	200 us	150 us	50 us
Kalman filter	300 us	220 us	80 us
PID compute	100 us	60 us	40 us
PWM write	100 us	80 us	20 us
Total	700 us	510 us	190 us (27%)
OS overhead + slack	300 us	—	—

Rule of thumb: If your measured total exceeds 70% of the period, you have insufficient margin for worst-case jitter.

5A. Control Theory Foundations

The PID controller is the workhorse of real-time control. Nearly every tutorial in this course — ball balancing, plate balancing, IMU controller — uses one. This section derives the mathematics behind it.

PID Transfer Function

In the Laplace domain, the PID controller is:

\[G(s) = K_p + \frac{K_i}{s} + K_d \cdot s\]

In the time domain, the controller output \(u(t)\) given error \(e(t)\) is:

\[u(t) = K_p \cdot e(t) + K_i \int_0^t e(\tau)\,d\tau + K_d \frac{de(t)}{dt}\]

Discrete-Time PID (What the Code Implements)

Microcontrollers and Linux control loops run at a fixed sample period \(T_s\). The discrete-time PID replaces integration with summation and differentiation with finite differences:

\[u[n] = K_p \cdot e[n] \;+\; K_i \cdot T_s \sum_{k=0}^{n} e[k] \;+\; K_d \frac{e[n] - e[n-1]}{T_s}\]

In code, the integral term is accumulated incrementally (integral += error * dt) and the derivative term uses the difference between the current and previous error.

What Each Term Does

Term	Analogy	Role	Too little	Too much
P (proportional)	Spring	Pushes output toward setpoint	Sluggish response	Oscillation
I (integral)	Memory	Eliminates steady-state error	Permanent offset	Windup, slow oscillation
D (derivative)	Damper	Resists rapid changes, reduces overshoot	Overshoot, ringing	Noise amplification

Plant Modeling: Ball-on-Beam

For the ball-balancing tutorial, the plant (ball on a tilted beam) is a double integrator. The ball's position \(x\) along the beam relates to beam angle \(\theta\) by:

\[\ddot{x} = -\frac{g}{L}\theta\]

where \(g \approx 9.81\text{ m/s}^2\) and \(L\) is the beam length. A double integrator is marginally stable — without control, any disturbance causes the ball to accelerate off the beam. This is why derivative action (damping) is essential.

Ziegler-Nichols Tuning

A classical method for finding initial PID gains:

Set \(K_i = 0\), \(K_d = 0\)
Increase \(K_p\) until the system oscillates with constant amplitude — this is the ultimate gain \(K_u\)
Measure the oscillation period \(T_u\)
Set gains using the Ziegler-Nichols table:

Controller	\(K_p\)	\(K_i\)	\(K_d\)
P only	\(0.5\,K_u\)	—	—
PI	\(0.45\,K_u\)	\(1.2\,K_p / T_u\)	—
PID	\(0.6\,K_u\)	\(2\,K_p / T_u\)	\(K_p \cdot T_u / 8\)

These are starting values — manual fine-tuning is almost always needed.

Anti-Windup

When the actuator saturates (e.g., servo at its physical limit), the error persists and the integral term accumulates without bound. When the error finally reverses, the bloated integral causes massive overshoot. Anti-windup clamps the integral term:

integral += error * dt
integral = max(min(integral, INTEGRAL_MAX), -INTEGRAL_MAX)  # clamp

Alternatively, stop accumulating when the output is saturated:

if abs(output) < OUTPUT_MAX:
    integral += error * dt

Worked Example: Ball-on-Beam

Using the parameters from the ball-balancing tutorial:

\(K_p = 25\), \(K_i = 0.5\), \(K_d = 12\), \(T_s = 20\text{ ms}\)
Distance sensor reads ball position, setpoint is beam center
At \(e[n] = 3\text{ cm}\), \(e[n-1] = 3.5\text{ cm}\), integral \(= 12\text{ cm-s}\):

\[u = 25 \times 3 + 0.5 \times 12 + 12 \times \frac{3 - 3.5}{0.02} = 75 + 6 + (-300) = -219\]

The large negative derivative term (ball is moving toward center) reduces the output, preventing overshoot. After clamping to the servo range, this becomes the servo angle command.

5B. Schedulability Analysis

The "70% rule of thumb" from Section 5 has a formal basis: Rate Monotonic Analysis (RMA), the foundational theory of real-time scheduling.

Rate Monotonic Scheduling (RMS)

RMS assigns priorities by period: shorter period = higher priority. For \(n\) periodic tasks, each with worst-case execution time \(C_i\) and period \(T_i\), the system is guaranteed schedulable if:

\[U = \sum_{i=1}^{n} \frac{C_i}{T_i} \leq n\left(2^{1/n} - 1\right)\]

This is the Liu & Layland bound (1973). The right-hand side converges:

Tasks (\(n\))	Utilization bound
1	1.000 (100%)
2	0.828 (82.8%)
3	0.780 (78.0%)
4	0.757 (75.7%)
\(\infty\)	\(\ln 2 \approx 0.693\) (69.3%)

The "70% rule" from Section 5 is the asymptotic RMA bound. If your total CPU utilization stays below 69.3%, the system is schedulable under RMS regardless of the number of tasks.

Note

The Liu & Layland bound is sufficient but not necessary — a task set with \(U > 0.693\) may still be schedulable. Exact analysis (response time analysis) can verify schedulability up to \(U = 1.0\) for specific task sets. But the bound gives a safe, quick check.

Earliest Deadline First (EDF)

EDF is a dynamic-priority scheduler: at each scheduling point, the task with the nearest deadline runs. EDF is provably optimal — if any scheduler can meet all deadlines, EDF can too. The schedulability condition is simply:

\[U = \sum_{i=1}^{n} \frac{C_i}{T_i} \leq 1.0\]

EDF achieves higher utilization than RMS but is harder to implement and analyze in practice. Linux's SCHED_DEADLINE policy implements EDF.

Response Time Analysis (Two-Task Example)

For exact schedulability, compute the worst-case response time \(R_i\) of each task. For the highest-priority task: \(R_1 = C_1\). For lower-priority tasks, \(R_i\) includes interference from higher-priority tasks:

\[R_i = C_i + \sum_{j < i} \left\lceil \frac{R_i}{T_j} \right\rceil C_j\]

This is solved iteratively. Example with two tasks:

Task 1 (sensor): \(C_1 = 0.2\text{ ms}\), \(T_1 = 1\text{ ms}\) (1 kHz)
Task 2 (logger): \(C_2 = 3\text{ ms}\), \(T_2 = 10\text{ ms}\) (100 Hz)

\(R_1 = 0.2\text{ ms}\) (no interference). For Task 2:

Iteration 1: \(R_2 = 3 + \lceil 3/1 \rceil \times 0.2 = 3 + 0.6 = 3.6\text{ ms}\)
Iteration 2: \(R_2 = 3 + \lceil 3.6/1 \rceil \times 0.2 = 3 + 0.8 = 3.8\text{ ms}\)
Iteration 3: \(R_2 = 3 + \lceil 3.8/1 \rceil \times 0.2 = 3 + 0.8 = 3.8\text{ ms}\) (converged)

\(R_2 = 3.8\text{ ms} < T_2 = 10\text{ ms}\) — Task 2 meets its deadline.

Priority Inversion (Formal)

Priority inversion occurs when a high-priority task \(\tau_H\) is blocked because a low-priority task \(\tau_L\) holds a resource that \(\tau_H\) needs, and a medium-priority task \(\tau_M\) preempts \(\tau_L\) — extending the blocking time of \(\tau_H\) indefinitely. This is the scenario that caused the Mars Pathfinder resets in 1997.

Priority inheritance (the fix used for Mars Pathfinder, and what PREEMPT_RT provides) temporarily raises \(\tau_L\)'s priority to \(\tau_H\)'s level while it holds the shared resource, preventing \(\tau_M\) from preempting.

Worked Example: Three-Task System

Task	Function	\(C_i\)	\(T_i\)	\(U_i\)
\(\tau_1\)	Sensor read	0.15 ms	1 ms (1 kHz)	0.150
\(\tau_2\)	Filter	0.30 ms	2 ms (500 Hz)	0.150
\(\tau_3\)	Data logger	5.0 ms	100 ms (10 Hz)	0.050

Total utilization: \(U = 0.150 + 0.150 + 0.050 = 0.350\)

Liu & Layland bound for \(n=3\): \(3(2^{1/3} - 1) = 0.780\)

Since \(0.350 < 0.780\), the system is guaranteed schedulable under RMS with significant margin.

6. Debugging and Validating RT

cyclictest

cyclictest measures scheduling latency — the time between when a thread should wake up and when it actually wakes up:

# Run for 100,000 iterations on CPU 3, priority 99
sudo cyclictest -t1 -p99 -a3 -i1000 -l100000

Flag breakdown:

Flag	Meaning
`-t1`	Use 1 measurement thread
`-p99`	Run at real-time priority 99 (highest SCHED_FIFO priority)
`-a3`	Pin the thread to CPU core 3 (avoids core migration jitter)
`-i1000`	Loop interval of 1000 us (1 ms = 1 kHz measurement rate)
`-l100000`	Run for 100,000 iterations (100 seconds at 1 kHz)

The output shows min, avg, and max latency. The max value is what matters — it represents the worst-case scheduling latency observed during the test. An average of 10 us means nothing if the max is 5 ms; your deadline is determined by worst-case, not typical behavior.

Good vs Bad Histograms

When you add -h400 to cyclictest, it produces a histogram. The shape of the histogram tells you whether your RT configuration is working:

Good (tall, narrow peak): 95%+ of samples within 10-20 us, no samples beyond 100 us. The system is well-tuned.
Bad (wide distribution, long tail): Samples scattered from 10 us to 5 ms. The tail determines your worst-case deadline miss — investigate the source of the outliers.

A long tail is usually caused by one of: CPU frequency transitions (DVFS), unmasked interrupts, cache misses from working set overflow, or a non-RT kernel preempting your thread.

ftrace and trace-cmd

ftrace traces kernel events: function calls, scheduling switches, interrupt handlers. trace-cmd provides a user-friendly wrapper:

# Record scheduling events for 10 seconds
sudo trace-cmd record -p function_graph -e sched_switch sleep 10
sudo trace-cmd report | head -50

Methodology: Test Under Load

Warning

Never report average latency. Always test under stress-ng load and report the 99.9th percentile (or maximum).

"Average latency is 50 us" means nothing if the worst case is 10 ms. The worst case is your deadline.

# Generate CPU + I/O stress
stress-ng --cpu 4 --io 2 --vm 1 --vm-bytes 128M --timeout 60s &

# Measure latency under stress
sudo cyclictest -t1 -p99 -a3 -i1000 -l100000 -h400 > histogram.txt

Latency histogram

Plot the histogram to see the distribution. A well-tuned RT system shows a tight cluster with no long tail:

     Latency (us)  |  Count
  ─────────────────┼────────
         0 -  10   |  ████████████████████  89,200
        10 -  20   |  ████                   4,100
        20 -  50   |  ██                     2,500
        50 - 100   |  █                      1,800
       100 - 200   |  ▌                        350
       200+        |  ▏                         50  ← investigate these!

Jitter Statistics

Understanding why worst-case matters requires a statistical perspective on RT latency.

Why RT Latency Is Not Gaussian

Scheduling latency has a hard lower bound (the minimum time for a context switch) but no hard upper bound — rare events (cache flush, TLB shootdown, IRQ storm) create an asymmetric long tail. The distribution is right-skewed, meaning the mean and median understate the worst case.

Percentile Analysis

The \(P_{99.9}\) (99.9th percentile) is the latency value below which 99.9% of all samples fall. For a 1 kHz control loop running 24/7:

Percentile	Meaning	Misses per day (at 1 kHz)
\(P_{99}\)	1 in 100 exceeds this	864
\(P_{99.9}\)	1 in 1,000 exceeds this	86
\(P_{99.99}\)	1 in 10,000 exceeds this	8.6
Max	Absolute worst observed	1 (by definition)

For firm real-time (motor control, audio), design to \(P_{99.9}\). For hard real-time (safety-critical), design to the observed maximum — or better, prove a worst-case bound analytically.

How Many Samples Do You Need?

To estimate the \(p\)-th tail percentile with confidence, you need at least:

\[N \geq \frac{10}{1 - p/100}\]

Target percentile	Minimum samples
\(P_{99}\)	1,000
\(P_{99.9}\)	10,000
\(P_{99.99}\)	100,000

A 100-second cyclictest at 1 kHz gives 100,000 samples — sufficient for \(P_{99.99}\). A 10-second test only supports \(P_{99.9}\) claims.

Outlier Classification

A sample more than \(3\sigma\) from the median (not mean — the median is robust to skew) warrants investigation. Common causes:

Single spike, non-recurring: CPU frequency transition (DVFS) — fix with performance governor
Periodic spikes: Timer tick interference — fix with nohz_full
Clustered spikes under load: Cache thrashing — fix with CPU isolation or reduce working set

7. CPU Isolation for Real-Time

On a standard Linux system, the kernel runs housekeeping tasks (timers, RCU callbacks, workqueues) on every CPU core. These tasks cause latency spikes on RT threads. CPU isolation reserves one or more cores exclusively for your RT workload:

# In /boot/firmware/cmdline.txt — isolate core 3 for RT:
... isolcpus=3 nohz_full=3 rcu_nocbs=3

Parameter	Effect
`isolcpus=3`	Removes core 3 from the general scheduler — only explicitly pinned tasks run there
`nohz_full=3`	Disables periodic timer ticks on core 3 when only one task is running (reduces interruptions)
`rcu_nocbs=3`	Moves RCU callback processing off core 3 to other cores

Before isolation (cyclictest on shared core): max latency ~200 us, frequent spikes from OS housekeeping. After isolation (cyclictest on isolated core): max latency ~30 us, clean histogram with no tail.

Pin your RT application to the isolated core with taskset -c 3 ./rt_app or cyclictest -a3.

8. DVFS and Frequency Scaling

DVFS (Dynamic Voltage and Frequency Scaling) saves power by reducing CPU frequency and voltage during idle periods. However, transitioning between frequency states takes 1-2 ms — which appears as a latency spike in your cyclictest histogram.

For real-time workloads, lock the CPU to maximum frequency:

# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Set performance governor (fixed frequency, no transitions)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

The performance governor holds the CPU at maximum frequency permanently. This uses more power but eliminates frequency-transition latency spikes. For battery-powered RT systems, this is a trade-off: accept higher power consumption during the RT window, or accept occasional 1-2 ms latency spikes from DVFS transitions.

Quick Checks

What is the difference between hard and soft real-time?
Name two things PREEMPT_RT changes in the kernel.
Why is average latency a misleading metric for RT systems?
When would you add an external MCU instead of using PREEMPT_RT?
What does cyclictest measure?

Mini Exercise

Scenario

Your team is building a robotic arm that must update motor positions at 500 Hz.

Calculate the loop period.
Create a latency budget (sensor read, filter, control, actuator write).
Which approach would you use: PREEMPT_RT only, or Linux + external MCU? Justify your choice using the comparison table from Section 2.
What percentile of cyclictest results must fit within your period for the system to be considered reliable?

Key Takeaways

Real-time means predictable, not fast
PREEMPT_RT gets Linux to ~50 us worst-case — enough for many industrial applications
Below ~10 us or for safety-critical systems, use an external MCU or heterogeneous SoC
Always validate under load; report worst-case (99.9th percentile), never average
Design with a latency budget — if measured time exceeds 70% of the period, you need more margin

Hands-On

PREEMPT_RT: Latency Measurement — measure scheduling latency with cyclictest
Jitter Measurement — analyze timing jitter in sensor loops
MCU Real-Time Controller — external Pico 2 W running the control loop, supervised from Linux