Real-Time Systems Design
Goal: Understand when Linux can meet real-time requirements and when you need external help.
Related Tutorials
For hands-on practice, see: PREEMPT_RT Latency | Jitter Measurement | MCU Real-Time Controller
Your sensor driver reads the IMU at 200 Hz — but sometimes a read takes 50 ms instead of 5 ms. The display stutters, the control loop misses its deadline. Can Linux do real-time? The answer is: it depends.
1. What Is Real-Time?
| Category | Example | Deadline Tolerance | Consequence of Miss |
|---|---|---|---|
| Hard RT | Flight controller, ABS brakes | 0 — deadline = system failure | Crash, injury, hardware damage |
| Firm RT | Audio playback, motor control | Small — occasional miss degrades output | Audible glitch, vibration |
| Soft RT | Video streaming, UI refresh | Flexible — user notices but system recovers | Frame drop, lag |
| Best-effort | File download, logging | None — finish eventually | Slower experience |
Real-time ≠ fast. It means predictable. A system that always responds in 10 ms is more real-time than one that usually responds in 1 ms but sometimes takes 100 ms.
2. PREEMPT_RT Deep Dive
Standard Linux was designed for servers and desktops, where the goal is to maximize total throughput — process as many requests as possible per second. Real-time systems have a fundamentally different goal: respond to each individual event within a guaranteed time. PREEMPT_RT is a set of kernel patches (now merged into mainline as of Linux 6.12) that reshapes the kernel's internal locking and interrupt handling to make scheduling more predictable. It does not make Linux faster — it makes Linux more deterministic.
The key insight behind PREEMPT_RT is that unpredictable latency in standard Linux comes from places where the kernel cannot be interrupted: inside interrupt handlers, inside spinlock-protected critical sections, and during RCU (Read-Copy-Update) callbacks. PREEMPT_RT converts these non-preemptible sections into preemptible ones, so a high-priority real-time task can interrupt almost anything.
What PREEMPT_RT changes:
- Threaded interrupts: Hardware IRQ handlers become kernel threads with priorities — a high-priority RT task can preempt an interrupt handler
- Sleeping spinlocks: Critical sections use mutexes instead of spinlocks, allowing preemption inside kernel code
- Priority inheritance: If a high-priority task blocks on a mutex held by a low-priority task, the low-priority task temporarily inherits the high priority (avoids priority inversion)
What these mean in practice:
- Threaded interrupts: Your 500 Hz control loop can be assigned a higher priority than the network interrupt handler — so an incoming packet burst cannot delay your sensor read.
- Sleeping spinlocks: A high-priority RT task is no longer blocked indefinitely while an unrelated low-priority task holds a kernel lock.
- Priority inheritance: Prevents unbounded priority inversion — the scenario that caused the Mars Pathfinder spacecraft to reset repeatedly in 1997 when a low-priority task held a lock needed by the high-priority bus controller.
What it does NOT change:
- Hardware latency (DMA, cache misses)
- GPU scheduling (display pipeline has its own timing)
- Worst-case interrupt latency floor (~10-50 us, set by hardware)
| Approach | Typical Worst-Case Latency | Certification | Complexity |
|---|---|---|---|
| Standard Linux kernel | ~1-10 ms | None | Low |
| PREEMPT_RT | ~20-80 us | Possible (IEC 62443) | Medium |
| Xenomai/RTAI (dual-kernel) | ~5-15 us | Possible | High |
| Bare-metal / RTOS | ~1 us | IEC 61508 possible | Application-dependent |
Cache Hierarchy and Latency
Memory access time is often the hidden bottleneck in real-time systems. Modern CPUs use a hierarchy of caches to bridge the gap between CPU speed and DRAM latency:
| Level | Typical Size (Cortex-A72) | Access Latency | What Lives Here |
|---|---|---|---|
| L1 cache | 32 KB per core | ~1 ns | Hot loop code + immediate data |
| L2 cache | 512 KB per core | ~5 ns | Working set of active task |
| L3 cache | 1 MB shared | ~20 ns | Shared data across cores |
| DRAM | 1-8 GB | ~100 ns | Everything else |
Working set: For a real-time control loop, the code and data touched in each iteration should fit in L1/L2 cache. If your control loop code is 20 KB and your sensor buffer is 4 KB, the total 24 KB working set fits comfortably in L1 (32 KB). If it spills into L2 or DRAM, latency becomes unpredictable because cache misses add 5-100 ns per access.
Cache-line false sharing: A cache line is typically 64 bytes. If two threads on different cores write to variables that happen to reside in the same 64-byte cache line, the hardware bounces the line between cores on every write — even though the threads are accessing different variables. This causes unexpected latency spikes in RT tasks. The fix is to align shared structures to cache-line boundaries (e.g., __attribute__((aligned(64))) in C).
Live Demo: cyclictest on Your Pi
Run cyclictest on the Raspberry Pi and observe scheduling latency. See PREEMPT_RT Latency Tutorial for the full hands-on.
3. When PREEMPT_RT Is Not Enough
PREEMPT_RT brings Linux into the ~20-80 us worst-case latency range, which is sufficient for many industrial applications: motor control at 1 kHz, sensor sampling at 500 Hz, and audio processing at 48 kHz all fit comfortably. But some applications need tighter guarantees. A safety-critical system that must respond within 5 us, a system that requires formal certification (IEC 61508 SIL 2+), or a control loop running at 10 kHz or faster cannot rely on Linux alone — the kernel's complexity means there are always edge cases where latency exceeds the budget.
For these cases, the solution is to move the time-critical work off Linux entirely and onto dedicated real-time hardware. Linux becomes the supervisor — configuring, monitoring, and logging — while a dedicated MCU or FPGA handles the hard real-time control loop.
For sub-10 us requirements, safety-critical certification (IEC 61508 SIL 2+), or guaranteed hard deadlines, Linux alone cannot help. Solutions:
- External MCU (Arduino, STM32 via UART/SPI): Linux configures and logs, MCU runs the RT control loop
- Heterogeneous SoC (STM32MP1, i.MX8M): one chip, two cores — Linux on Cortex-A, RTOS on Cortex-M. See Boot Architectures Section 6 for details on heterogeneous boot.
- FPGA: hardware-level determinism, sub-microsecond response
Architecture patterns:
graph LR
subgraph "Linux (Cortex-A)"
UI[UI / Dashboard]
LOG[Data Logger]
CFG[Configuration]
end
subgraph "Bridge"
UART[UART / SPI / Shared Memory]
end
subgraph "MCU / Cortex-M"
SENSE[Sensor Read]
CTRL[Control Algorithm]
ACT[Actuator Drive]
end
CFG --> UART
UART --> SENSE
SENSE --> CTRL
CTRL --> ACT
ACT -->|telemetry| UART
UART --> LOG
UART --> UI
Design rule: Linux is the supervisor (configure, monitor, log, display). The MCU is the worker (read sensors, compute control, drive actuators).
4. ROS 2 — When You Need Middleware
ROS 2 (Robot Operating System 2) is a DDS-based publish/subscribe middleware for robotic systems:
- Nodes: independent processes (sensor node, planner node, motor node)
- Topics: named data channels (e.g.,
/imu/data,/cmd_vel) - QoS profiles: configure reliability, deadline, liveliness per topic
- micro-ROS: lightweight ROS 2 for MCUs (runs on FreeRTOS/Zephyr, communicates with Linux host)
When to use: Multi-node robotic systems, teams developing different subsystems, need for standard interfaces. When overkill: Single-board sensor-to-display pipeline (like our level display), simple two-device setups.
Info
ROS 2 is not covered further in this course. If your project involves multi-node robotics, see ros.org for tutorials.
5. RT Control Loop Design
A real-time control loop follows the Sensor → Filter → Control → Actuator pipeline:
graph LR
S[Sensor Read<br/>Budget: 2 ms] --> F[Filter<br/>Budget: 1 ms]
F --> C[Control Law<br/>Budget: 0.5 ms]
C --> A[Actuator Write<br/>Budget: 0.5 ms]
A -->|Next cycle| S
style S fill:#e1f5fe
style F fill:#fff3e0
style C fill:#e8f5e9
style A fill:#fce4ec
Latency budget: Each stage has a maximum execution time. The sum must be less than the loop period.
Example: Robot arm with 1 kHz control loop (1 ms period):
| Stage | Budget | Measured | Margin |
|---|---|---|---|
| IMU read (SPI) | 200 us | 150 us | 50 us |
| Kalman filter | 300 us | 220 us | 80 us |
| PID compute | 100 us | 60 us | 40 us |
| PWM write | 100 us | 80 us | 20 us |
| Total | 700 us | 510 us | 190 us (27%) |
| OS overhead + slack | 300 us | — | — |
Rule of thumb: If your measured total exceeds 70% of the period, you have insufficient margin for worst-case jitter.
5A. Control Theory Foundations
The PID controller is the workhorse of real-time control. Nearly every tutorial in this course — ball balancing, plate balancing, IMU controller — uses one. This section derives the mathematics behind it.
PID Transfer Function
In the Laplace domain, the PID controller is:
In the time domain, the controller output \(u(t)\) given error \(e(t)\) is:
Discrete-Time PID (What the Code Implements)
Microcontrollers and Linux control loops run at a fixed sample period \(T_s\). The discrete-time PID replaces integration with summation and differentiation with finite differences:
In code, the integral term is accumulated incrementally (integral += error * dt) and the derivative term uses the difference between the current and previous error.
What Each Term Does
| Term | Analogy | Role | Too little | Too much |
|---|---|---|---|---|
| P (proportional) | Spring | Pushes output toward setpoint | Sluggish response | Oscillation |
| I (integral) | Memory | Eliminates steady-state error | Permanent offset | Windup, slow oscillation |
| D (derivative) | Damper | Resists rapid changes, reduces overshoot | Overshoot, ringing | Noise amplification |
Plant Modeling: Ball-on-Beam
For the ball-balancing tutorial, the plant (ball on a tilted beam) is a double integrator. The ball's position \(x\) along the beam relates to beam angle \(\theta\) by:
where \(g \approx 9.81\text{ m/s}^2\) and \(L\) is the beam length. A double integrator is marginally stable — without control, any disturbance causes the ball to accelerate off the beam. This is why derivative action (damping) is essential.
Ziegler-Nichols Tuning
A classical method for finding initial PID gains:
- Set \(K_i = 0\), \(K_d = 0\)
- Increase \(K_p\) until the system oscillates with constant amplitude — this is the ultimate gain \(K_u\)
- Measure the oscillation period \(T_u\)
- Set gains using the Ziegler-Nichols table:
| Controller | \(K_p\) | \(K_i\) | \(K_d\) |
|---|---|---|---|
| P only | \(0.5\,K_u\) | — | — |
| PI | \(0.45\,K_u\) | \(1.2\,K_p / T_u\) | — |
| PID | \(0.6\,K_u\) | \(2\,K_p / T_u\) | \(K_p \cdot T_u / 8\) |
These are starting values — manual fine-tuning is almost always needed.
Anti-Windup
When the actuator saturates (e.g., servo at its physical limit), the error persists and the integral term accumulates without bound. When the error finally reverses, the bloated integral causes massive overshoot. Anti-windup clamps the integral term:
Alternatively, stop accumulating when the output is saturated:
Worked Example: Ball-on-Beam
Using the parameters from the ball-balancing tutorial:
- \(K_p = 25\), \(K_i = 0.5\), \(K_d = 12\), \(T_s = 20\text{ ms}\)
- Distance sensor reads ball position, setpoint is beam center
- At \(e[n] = 3\text{ cm}\), \(e[n-1] = 3.5\text{ cm}\), integral \(= 12\text{ cm-s}\):
The large negative derivative term (ball is moving toward center) reduces the output, preventing overshoot. After clamping to the servo range, this becomes the servo angle command.
5B. Schedulability Analysis
The "70% rule of thumb" from Section 5 has a formal basis: Rate Monotonic Analysis (RMA), the foundational theory of real-time scheduling.
Rate Monotonic Scheduling (RMS)
RMS assigns priorities by period: shorter period = higher priority. For \(n\) periodic tasks, each with worst-case execution time \(C_i\) and period \(T_i\), the system is guaranteed schedulable if:
This is the Liu & Layland bound (1973). The right-hand side converges:
| Tasks (\(n\)) | Utilization bound |
|---|---|
| 1 | 1.000 (100%) |
| 2 | 0.828 (82.8%) |
| 3 | 0.780 (78.0%) |
| 4 | 0.757 (75.7%) |
| \(\infty\) | \(\ln 2 \approx 0.693\) (69.3%) |
The "70% rule" from Section 5 is the asymptotic RMA bound. If your total CPU utilization stays below 69.3%, the system is schedulable under RMS regardless of the number of tasks.
Note
The Liu & Layland bound is sufficient but not necessary — a task set with \(U > 0.693\) may still be schedulable. Exact analysis (response time analysis) can verify schedulability up to \(U = 1.0\) for specific task sets. But the bound gives a safe, quick check.
Earliest Deadline First (EDF)
EDF is a dynamic-priority scheduler: at each scheduling point, the task with the nearest deadline runs. EDF is provably optimal — if any scheduler can meet all deadlines, EDF can too. The schedulability condition is simply:
EDF achieves higher utilization than RMS but is harder to implement and analyze in practice. Linux's SCHED_DEADLINE policy implements EDF.
Response Time Analysis (Two-Task Example)
For exact schedulability, compute the worst-case response time \(R_i\) of each task. For the highest-priority task: \(R_1 = C_1\). For lower-priority tasks, \(R_i\) includes interference from higher-priority tasks:
This is solved iteratively. Example with two tasks:
- Task 1 (sensor): \(C_1 = 0.2\text{ ms}\), \(T_1 = 1\text{ ms}\) (1 kHz)
- Task 2 (logger): \(C_2 = 3\text{ ms}\), \(T_2 = 10\text{ ms}\) (100 Hz)
\(R_1 = 0.2\text{ ms}\) (no interference). For Task 2:
- Iteration 1: \(R_2 = 3 + \lceil 3/1 \rceil \times 0.2 = 3 + 0.6 = 3.6\text{ ms}\)
- Iteration 2: \(R_2 = 3 + \lceil 3.6/1 \rceil \times 0.2 = 3 + 0.8 = 3.8\text{ ms}\)
- Iteration 3: \(R_2 = 3 + \lceil 3.8/1 \rceil \times 0.2 = 3 + 0.8 = 3.8\text{ ms}\) (converged)
\(R_2 = 3.8\text{ ms} < T_2 = 10\text{ ms}\) — Task 2 meets its deadline.
Priority Inversion (Formal)
Priority inversion occurs when a high-priority task \(\tau_H\) is blocked because a low-priority task \(\tau_L\) holds a resource that \(\tau_H\) needs, and a medium-priority task \(\tau_M\) preempts \(\tau_L\) — extending the blocking time of \(\tau_H\) indefinitely. This is the scenario that caused the Mars Pathfinder resets in 1997.
Priority inheritance (the fix used for Mars Pathfinder, and what PREEMPT_RT provides) temporarily raises \(\tau_L\)'s priority to \(\tau_H\)'s level while it holds the shared resource, preventing \(\tau_M\) from preempting.
Worked Example: Three-Task System
| Task | Function | \(C_i\) | \(T_i\) | \(U_i\) |
|---|---|---|---|---|
| \(\tau_1\) | Sensor read | 0.15 ms | 1 ms (1 kHz) | 0.150 |
| \(\tau_2\) | Filter | 0.30 ms | 2 ms (500 Hz) | 0.150 |
| \(\tau_3\) | Data logger | 5.0 ms | 100 ms (10 Hz) | 0.050 |
Total utilization: \(U = 0.150 + 0.150 + 0.050 = 0.350\)
Liu & Layland bound for \(n=3\): \(3(2^{1/3} - 1) = 0.780\)
Since \(0.350 < 0.780\), the system is guaranteed schedulable under RMS with significant margin.
6. Debugging and Validating RT
cyclictest
cyclictest measures scheduling latency — the time between when a thread should wake up and when it actually wakes up:
Flag breakdown:
| Flag | Meaning |
|---|---|
-t1 |
Use 1 measurement thread |
-p99 |
Run at real-time priority 99 (highest SCHED_FIFO priority) |
-a3 |
Pin the thread to CPU core 3 (avoids core migration jitter) |
-i1000 |
Loop interval of 1000 us (1 ms = 1 kHz measurement rate) |
-l100000 |
Run for 100,000 iterations (100 seconds at 1 kHz) |
The output shows min, avg, and max latency. The max value is what matters — it represents the worst-case scheduling latency observed during the test. An average of 10 us means nothing if the max is 5 ms; your deadline is determined by worst-case, not typical behavior.
Good vs Bad Histograms
When you add -h400 to cyclictest, it produces a histogram. The shape of the histogram tells you whether your RT configuration is working:
- Good (tall, narrow peak): 95%+ of samples within 10-20 us, no samples beyond 100 us. The system is well-tuned.
- Bad (wide distribution, long tail): Samples scattered from 10 us to 5 ms. The tail determines your worst-case deadline miss — investigate the source of the outliers.
A long tail is usually caused by one of: CPU frequency transitions (DVFS), unmasked interrupts, cache misses from working set overflow, or a non-RT kernel preempting your thread.
ftrace and trace-cmd
ftrace traces kernel events: function calls, scheduling switches, interrupt handlers. trace-cmd provides a user-friendly wrapper:
# Record scheduling events for 10 seconds
sudo trace-cmd record -p function_graph -e sched_switch sleep 10
sudo trace-cmd report | head -50
Methodology: Test Under Load
Warning
Never report average latency. Always test under stress-ng load and report the 99.9th percentile (or maximum).
"Average latency is 50 us" means nothing if the worst case is 10 ms. The worst case is your deadline.
# Generate CPU + I/O stress
stress-ng --cpu 4 --io 2 --vm 1 --vm-bytes 128M --timeout 60s &
# Measure latency under stress
sudo cyclictest -t1 -p99 -a3 -i1000 -l100000 -h400 > histogram.txt
Latency histogram
Plot the histogram to see the distribution. A well-tuned RT system shows a tight cluster with no long tail:
Latency (us) | Count
─────────────────┼────────
0 - 10 | ████████████████████ 89,200
10 - 20 | ████ 4,100
20 - 50 | ██ 2,500
50 - 100 | █ 1,800
100 - 200 | ▌ 350
200+ | ▏ 50 ← investigate these!
Jitter Statistics
Understanding why worst-case matters requires a statistical perspective on RT latency.
Why RT Latency Is Not Gaussian
Scheduling latency has a hard lower bound (the minimum time for a context switch) but no hard upper bound — rare events (cache flush, TLB shootdown, IRQ storm) create an asymmetric long tail. The distribution is right-skewed, meaning the mean and median understate the worst case.
Percentile Analysis
The \(P_{99.9}\) (99.9th percentile) is the latency value below which 99.9% of all samples fall. For a 1 kHz control loop running 24/7:
| Percentile | Meaning | Misses per day (at 1 kHz) |
|---|---|---|
| \(P_{99}\) | 1 in 100 exceeds this | 864 |
| \(P_{99.9}\) | 1 in 1,000 exceeds this | 86 |
| \(P_{99.99}\) | 1 in 10,000 exceeds this | 8.6 |
| Max | Absolute worst observed | 1 (by definition) |
For firm real-time (motor control, audio), design to \(P_{99.9}\). For hard real-time (safety-critical), design to the observed maximum — or better, prove a worst-case bound analytically.
How Many Samples Do You Need?
To estimate the \(p\)-th tail percentile with confidence, you need at least:
| Target percentile | Minimum samples |
|---|---|
| \(P_{99}\) | 1,000 |
| \(P_{99.9}\) | 10,000 |
| \(P_{99.99}\) | 100,000 |
A 100-second cyclictest at 1 kHz gives 100,000 samples — sufficient for \(P_{99.99}\). A 10-second test only supports \(P_{99.9}\) claims.
Outlier Classification
A sample more than \(3\sigma\) from the median (not mean — the median is robust to skew) warrants investigation. Common causes:
- Single spike, non-recurring: CPU frequency transition (DVFS) — fix with
performancegovernor - Periodic spikes: Timer tick interference — fix with
nohz_full - Clustered spikes under load: Cache thrashing — fix with CPU isolation or reduce working set
7. CPU Isolation for Real-Time
On a standard Linux system, the kernel runs housekeeping tasks (timers, RCU callbacks, workqueues) on every CPU core. These tasks cause latency spikes on RT threads. CPU isolation reserves one or more cores exclusively for your RT workload:
| Parameter | Effect |
|---|---|
isolcpus=3 |
Removes core 3 from the general scheduler — only explicitly pinned tasks run there |
nohz_full=3 |
Disables periodic timer ticks on core 3 when only one task is running (reduces interruptions) |
rcu_nocbs=3 |
Moves RCU callback processing off core 3 to other cores |
Before isolation (cyclictest on shared core): max latency ~200 us, frequent spikes from OS housekeeping. After isolation (cyclictest on isolated core): max latency ~30 us, clean histogram with no tail.
Pin your RT application to the isolated core with taskset -c 3 ./rt_app or cyclictest -a3.
8. DVFS and Frequency Scaling
DVFS (Dynamic Voltage and Frequency Scaling) saves power by reducing CPU frequency and voltage during idle periods. However, transitioning between frequency states takes 1-2 ms — which appears as a latency spike in your cyclictest histogram.
For real-time workloads, lock the CPU to maximum frequency:
# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Set performance governor (fixed frequency, no transitions)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
The performance governor holds the CPU at maximum frequency permanently. This uses more power but eliminates frequency-transition latency spikes. For battery-powered RT systems, this is a trade-off: accept higher power consumption during the RT window, or accept occasional 1-2 ms latency spikes from DVFS transitions.
Quick Checks
- What is the difference between hard and soft real-time?
- Name two things PREEMPT_RT changes in the kernel.
- Why is average latency a misleading metric for RT systems?
- When would you add an external MCU instead of using PREEMPT_RT?
- What does
cyclictestmeasure?
Mini Exercise
Scenario
Your team is building a robotic arm that must update motor positions at 500 Hz.
- Calculate the loop period.
- Create a latency budget (sensor read, filter, control, actuator write).
- Which approach would you use: PREEMPT_RT only, or Linux + external MCU? Justify your choice using the comparison table from Section 2.
- What percentile of
cyclictestresults must fit within your period for the system to be considered reliable?
Key Takeaways
- Real-time means predictable, not fast
- PREEMPT_RT gets Linux to ~50 us worst-case — enough for many industrial applications
- Below ~10 us or for safety-critical systems, use an external MCU or heterogeneous SoC
- Always validate under load; report worst-case (99.9th percentile), never average
- Design with a latency budget — if measured time exceeds 70% of the period, you need more margin
Hands-On
- PREEMPT_RT: Latency Measurement — measure scheduling latency with
cyclictest - Jitter Measurement — analyze timing jitter in sensor loops
- MCU Real-Time Controller — external Pico 2 W running the control loop, supervised from Linux