Performance Profiling for Embedded Linux
Goal: Learn a systematic workflow for finding and fixing performance bottlenecks in your own applications — not just following a guided optimization tutorial, but knowing what tool to use when your project is slow and you do not know why.
Related Tutorials
For guided measurement practice, see: Jitter Measurement | PREEMPT_RT Latency | SPI DMA Optimization
For architectural context, see: Software Architecture | Real-Time Systems | Real-Time Graphics
Your level display stutters. Your ball detection runs at 17 FPS instead of 30. Your launcher takes 3 seconds to reappear after a child exits. Your dashboard's CPU usage is 40% for what should be a trivial render.
The tutorials taught you how to measure jitter and how to apply DMA optimization. But those were guided — someone told you where the problem was and what to optimize. In your own project, nobody tells you. You need a workflow.
1. The Profiling Workflow
Every performance problem follows the same cycle:
┌──────────────┐
│ 1. MEASURE │ "How slow is it?"
│ Baseline │ FPS, latency, CPU%, memory
└──────┬───────┘
│
┌──────▼───────┐
│ 2. PROFILE │ "Where is time spent?"
│ Find hotspot│ perf, strace, ftrace
└──────┬───────┘
│
┌──────▼───────┐
│ 3. IDENTIFY │ "Why is it slow there?"
│ Root cause │ Blocking I/O? Missed VSync? Copy?
└──────┬───────┘
│
┌──────▼───────┐
│ 4. FIX │ "Change one thing"
│ Targeted │ DMA, cache, algorithm, architecture
└──────┬───────┘
│
┌──────▼───────┐
│ 5. VERIFY │ "Did it help?"
│ Re-measure │ Same baseline test, compare
└──────────────┘
The most common mistake: skipping to step 4. Students guess the bottleneck, apply a "fix," and wonder why nothing changed. Always measure first. Always profile before optimizing.
2. Step 1 — Measure: Establish a Baseline
Before changing anything, quantify the problem with numbers.
Frame rate (display applications)
Add a frame counter to your render loop:
struct timespec now, prev;
clock_gettime(CLOCK_MONOTONIC, &prev);
int frame_count = 0;
while (running) {
render();
frame_count++;
clock_gettime(CLOCK_MONOTONIC, &now);
double elapsed = (now.tv_sec - prev.tv_sec) +
(now.tv_nsec - prev.tv_nsec) / 1e9;
if (elapsed >= 1.0) {
printf("FPS: %d\n", frame_count);
frame_count = 0;
prev = now;
}
}
Or in Python:
import time
t0 = time.monotonic()
frames = 0
while True:
process_frame()
frames += 1
if time.monotonic() - t0 >= 1.0:
print(f"FPS: {frames}")
frames = 0
t0 = time.monotonic()
CPU usage
# Overall system CPU
mpstat 1
# Per-process CPU
pidstat -p $(pgrep my_app) 1
# Per-thread CPU (inside a process)
pidstat -t -p $(pgrep my_app) 1
Memory usage
# Per-process memory (RSS = physical, VSZ = virtual)
ps -o pid,rss,vsz,comm -p $(pgrep my_app)
# System-wide
free -m
Timing individual stages
Instrument your pipeline by timing each stage:
struct timespec t0, t1, t2, t3;
clock_gettime(CLOCK_MONOTONIC, &t0);
sensor_read();
clock_gettime(CLOCK_MONOTONIC, &t1);
process_data();
clock_gettime(CLOCK_MONOTONIC, &t2);
render();
clock_gettime(CLOCK_MONOTONIC, &t3);
printf("sensor: %.1f ms process: %.1f ms render: %.1f ms\n",
diff_ms(t0, t1), diff_ms(t1, t2), diff_ms(t2, t3));
Tip
Write timing data to a CSV file, not just the terminal. This lets you plot histograms and calculate percentiles later. See the Jitter Measurement tutorial for the CSV logging pattern.
3. Step 2 — Profile: Find Where Time Is Spent
Once you know how slow the system is, find where the time goes.
perf — CPU Profiling
perf samples where the CPU is executing and builds a statistical profile. It answers: "which functions consume the most CPU time?"
# Record 10 seconds of CPU samples for your application
sudo perf record -g -p $(pgrep my_app) -- sleep 10
# Show the results — sorted by CPU time
sudo perf report
Example output:
Overhead Command Shared Object Symbol
-------- ------- ---------------- --------------------------
32.50% my_app my_app [.] render_horizon
18.20% my_app libc.so [.] memcpy
12.80% my_app my_app [.] imu_read_spi
9.40% my_app libSDL2.so [.] SDL_RenderPresent
...
Reading the output: render_horizon takes 32.5% of CPU time. If the app is CPU-bound, this is where optimization matters. If memcpy is 18%, you are copying data unnecessarily.
perf — Counting Events
For quick checks without recording:
# Count cache misses, branch mispredictions, instructions
sudo perf stat -p $(pgrep my_app) -- sleep 5
Performance counter stats for process '1234':
1,245,678,901 instructions
42,345,678 cache-misses (3.4% of cache refs)
1,234,567 branch-misses (0.8% of branches)
5.003 seconds time elapsed
High cache-miss ratio (>5%) suggests data layout problems. High branch-miss ratio (>5%) suggests unpredictable conditional logic.
strace — System Call Profiling
strace traces every system call your application makes. It answers: "is the app spending time in the kernel?"
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
45.32 0.234000 234 1000 read
22.10 0.114000 57 2000 ioctl
15.80 0.081600 41 2000 write
8.50 0.043900 44 1000 clock_nanosleep
Reading the output: If read dominates and each call takes 234 us, the sensor read is blocking. If ioctl dominates, the DRM/KMS calls are slow. If clock_nanosleep dominates, the app is mostly idle (good — means it is not CPU-bound).
strace — Individual Calls
# Show every read() call with timestamps and duration
sudo strace -T -e trace=read -p $(pgrep my_app) 2>&1 | head -20
read(4, "\x00\x01\x02...", 14) = 14 <0.001234>
read(4, "\x00\x01\x02...", 14) = 14 <0.001198>
read(4, "\x00\x01\x02...", 14) = 14 <0.045321> ← outlier!
The <0.045321> shows one read took 45 ms — 40x longer than normal. This is a scheduling delay (the thread was preempted during the SPI transfer).
ftrace — Kernel-Level Tracing
For deep investigation of scheduling, interrupts, and driver behavior:
# Trace scheduling events for your process
sudo trace-cmd record -p function_graph -F my_app -- sleep 5
sudo trace-cmd report | less
Or use ftrace directly:
# Enable function_graph tracer
echo function_graph | sudo tee /sys/kernel/debug/tracing/current_tracer
echo $PID | sudo tee /sys/kernel/debug/tracing/set_ftrace_pid
# Read trace
sudo cat /sys/kernel/debug/tracing/trace | head -50
ftrace shows function call graphs with execution times:
my_app-1234 [001] 1234.567890: | spi_sync() {
my_app-1234 [001] 1234.567891: | spi_transfer_one() {
my_app-1234 [001] 1234.567920: | } /* 29 us */
my_app-1234 [001] 1234.567921: | } /* 31 us */
Warning
ftrace has significant overhead. Use it for targeted investigation of specific functions, not for always-on monitoring. Disable it after profiling.
4. Step 3 — Identify: Common Bottleneck Patterns
Once profiling points to a function or syscall, identify the root cause:
Pattern: Blocking I/O in the Render Loop
Symptom: strace -c shows read() taking 30%+ of time. FPS drops when sensor is slow.
Root cause: Sensor read is synchronous in the render thread. When the sensor takes longer than expected (bus contention, scheduling delay), the frame is late.
Fix: Move sensor read to a separate thread. Share data via atomic variable or mutex.
Before: [read sensor]──[process]──[render]──[flip] ← serial, blocked
After: Thread 1: [read]──[read]──[read]──[read] ← independent
Thread 2: [render]──[flip]──[render]──[flip] ← uses last value
Pattern: Unnecessary Memory Copies
Symptom: perf report shows memcpy in top 5. High CPU usage for simple rendering.
Root cause: Data is copied between buffers instead of processed in-place. Common in camera pipelines (capture buffer → processing buffer → display buffer).
Fix: Use pointer swapping instead of copying. Map buffers once and operate on them directly.
Pattern: Missed VSync
Symptom: Render takes 15 ms, but FPS is 30 instead of 60. Frame timing shows alternating 16.7 ms and 33.4 ms gaps.
Root cause: Render time is close to the 16.7 ms budget. Some frames take slightly longer, missing the VBlank by microseconds, and must wait an entire extra frame period.
Fix: Reduce render complexity (fewer draw calls, simpler geometry), or switch to triple buffering (trades latency for throughput).
Pattern: Python GIL Contention
Symptom: Python app uses 100% of one core but FPS is low. Multi-threaded but not faster.
Root cause: The Global Interpreter Lock (GIL) prevents true parallelism. Threads take turns running Python code.
Fix: Use multiprocessing instead of threading for CPU-bound work. Or move hot loops to C (NumPy/OpenCV operations already bypass the GIL).
Pattern: Excessive Logging
Symptom: strace -c shows write() to stderr/journal taking significant time. CPU spikes correlate with log output.
Root cause: print() or printf() in the hot loop. Each call is a syscall, often with string formatting overhead.
Fix: Log at reduced rate (every 100th frame), or use a ring buffer that writes in batch.
Pattern: Thermal Throttling
Symptom: Performance degrades over time. CPU frequency drops. dmesg shows throttling warnings.
Root cause: SoC overheats under sustained load. The kernel reduces CPU frequency to stay within thermal limits.
Fix: Add a heatsink. Reduce sustained CPU load. Check with:
# Monitor CPU frequency and temperature
watch -n 1 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq && \
cat /sys/class/thermal/thermal_zone0/temp'
5. Tool Selection Guide
"My app is slow"
│
├── CPU usage high (>50%)? ──► perf record → perf report
│ "Which function eats the CPU?"
│
├── CPU usage low but FPS low? ──► strace -c
│ "Is it blocking on I/O or sleeping?"
│
├── Occasional spikes / jitter? ──► strace -T or ftrace
│ "What causes the outliers?"
│
├── Slow after sustained load? ──► Check thermal throttling
│ "Is the CPU frequency dropping?"
│
└── Everything looks fast but display stutters? ──► Check VSync
"Missed VBlank deadlines?"
Quick Reference: When to Use Each Tool
| Tool | What It Measures | When to Use | Overhead |
|---|---|---|---|
time ./my_app |
Total wall-clock time | Quick sanity check | None |
mpstat 1 |
Per-core CPU utilization | Always — first tool to run | Negligible |
pidstat -t |
Per-thread CPU and I/O | Multi-threaded apps | Low |
perf stat |
Hardware counters (cache, branches) | Suspected CPU-bound issues | Negligible |
perf record + report |
Function-level CPU profiling | Finding hot functions | Low (~2%) |
strace -c |
Syscall time summary | I/O-heavy or blocking issues | Medium (~10%) |
strace -T -e read |
Individual call durations | Finding slow I/O calls | Medium |
ftrace / trace-cmd |
Kernel function timing | Driver or scheduling issues | High |
cyclictest |
Scheduling latency | RT jitter measurement | Dedicated |
valgrind --tool=callgrind |
Instruction-level profiling | Algorithmic optimization | Very high (20-50x) |
Warning
Do not profile with multiple tools simultaneously. Each tool adds overhead that distorts the others' measurements. Run one tool at a time, against the same workload.
6. Embedded-Specific Profiling Considerations
Cross-compilation and debug symbols
perf report needs debug symbols to show function names. When cross-compiling:
# Build with debug symbols (does not affect performance)
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
RelWithDebInfo keeps optimizations on but includes symbol tables. Do not use Debug for profiling — it disables optimizations and gives misleading results.
perf on Raspberry Pi
perf is available on Raspberry Pi OS:
On custom Buildroot images, enable BR2_PACKAGE_LINUX_TOOLS_PERF=y.
Profiling without root
Most perf operations require root or the perf_event_paranoid sysctl:
# Allow perf for all users (development only, not production)
sudo sysctl kernel.perf_event_paranoid=-1
Remote profiling
When the Pi has no display (headless):
# Record on Pi
sudo perf record -g -p $(pgrep my_app) -- sleep 10
# Copy to host for analysis
scp linux@pi:perf.data .
perf report -i perf.data
7. Example: Profiling a Slow Camera Pipeline
Walk through the workflow with a concrete example from the Ball Detection tutorial.
Problem: Ball detection runs at 17 FPS. Target is 30 FPS.
Step 1: Measure baseline
# Already built into the tutorial — toggle FPS display with 'f'
# Output: "FPS: 17.2 | Pipeline: blur+morph+circ"
Step 2: Profile
# Which functions take CPU time?
sudo perf record -g -p $(pgrep python3) -- sleep 10
sudo perf report
Overhead Symbol
-------- ------------------------------------------
28.30% cv::morphologyEx
22.10% cv::GaussianBlur
18.50% cv::findContours
12.40% cv::cvtColor
8.20% cv::threshold
Step 3: Identify
morphologyEx (28%) and GaussianBlur (22%) dominate. Together they take 50% of frame time. On a matte black surface with a white ball, high-contrast conditions may not need these preprocessing steps.
Step 4: Fix
Disable morphology (press m) and blur (press b) in the pipeline.
Step 5: Verify
Result: 65% FPS improvement by removing two pipeline stages that were unnecessary for the specific setup. No code optimization needed — the fix was doing less work.
Tip
The cheapest optimization is removing unnecessary work. Profile first — you may find that 40% of CPU time goes to a feature that is not needed for your specific use case.
8. Example: Profiling SDL2 Frame Drops
Problem: SDL2 level display runs at 60 FPS on the bench but drops to 45 FPS when the data logger runs in the background.
Step 1: Measure
# Level display with CSV logging
sudo SDL_VIDEODRIVER=kmsdrm ./level_sdl2 -l test.csv &
# Start the data logger
sudo systemctl start data-logger.service
# Check FPS in the CSV
awk -F, 'NR>1 {print 1e9/$3}' test.csv | sort -n | tail -5
Step 2: Profile
CPU %usr %sys %iowait %idle
0 45.2 12.3 8.1 34.4 ← level_sdl2 + data-logger
1 0.1 0.0 0.0 99.9
2 0.1 0.0 0.0 99.9
3 0.1 0.0 0.0 99.9
Both applications share Core 0. The data logger's disk I/O (%iowait = 8.1%) delays the render thread.
Step 3: Identify
CPU contention on Core 0. The logger's write() calls trigger disk I/O that causes scheduling delays for the render thread.
Step 4: Fix
Pin the logger to a different core:
Step 5: Verify
Result: Core partitioning solved the problem. No code changes needed.
9. The Optimization Priority List
When profiling reveals a bottleneck, fix in this order — cheapest and most impactful first:
| Priority | Optimization | Effort | Impact | Example |
|---|---|---|---|---|
| 1 | Remove unnecessary work | Minutes | High | Disable morphology in ball detection |
| 2 | Fix architecture | Hours | High | Move blocking I/O off the render thread |
| 3 | Use the right API | Hours | Medium | DRM page flip instead of fbdev memcpy |
| 4 | Core partitioning | Minutes | Medium | isolcpus, taskset |
| 5 | Algorithm improvement | Hours | Medium | Binary search instead of linear scan |
| 6 | Reduce copies | Hours | Medium | Pointer swap instead of memcpy |
| 7 | DMA for peripherals | Days | Medium | SPI DMA for high-rate sensors |
| 8 | Cache optimization | Days | Low-Medium | Structure packing, prefetch |
| 9 | Assembly / SIMD | Days | Low | NEON intrinsics for image processing |
Warning
Do not jump to priority 7-9 without checking 1-3 first. Most student projects are slow because of architecture (blocking I/O, shared cores) or unnecessary work, not because of cache misses.
Summary
| Step | Tool | Question |
|---|---|---|
| Measure | FPS counter, mpstat, pidstat |
How slow is it? |
| Profile | perf record, strace -c |
Where does time go? |
| Identify | Pattern matching (see Section 4) | Why is it slow there? |
| Fix | Targeted change (see Priority List) | Change one thing |
| Verify | Same baseline measurement | Did it help? |
The workflow is always the same. The tools change depending on whether the bottleneck is CPU-bound (perf), I/O-bound (strace), scheduling-related (ftrace/cyclictest), or thermal.
The single most important rule: measure before optimizing, measure after optimizing, and change only one thing at a time.