Skip to content

Performance Profiling for Embedded Linux

Goal: Learn a systematic workflow for finding and fixing performance bottlenecks in your own applications — not just following a guided optimization tutorial, but knowing what tool to use when your project is slow and you do not know why.

Related Tutorials

For guided measurement practice, see: Jitter Measurement | PREEMPT_RT Latency | SPI DMA Optimization

For architectural context, see: Software Architecture | Real-Time Systems | Real-Time Graphics


Your level display stutters. Your ball detection runs at 17 FPS instead of 30. Your launcher takes 3 seconds to reappear after a child exits. Your dashboard's CPU usage is 40% for what should be a trivial render.

The tutorials taught you how to measure jitter and how to apply DMA optimization. But those were guided — someone told you where the problem was and what to optimize. In your own project, nobody tells you. You need a workflow.


1. The Profiling Workflow

Every performance problem follows the same cycle:

  ┌──────────────┐
  │  1. MEASURE  │  "How slow is it?"
  │  Baseline    │  FPS, latency, CPU%, memory
  └──────┬───────┘
  ┌──────▼───────┐
  │  2. PROFILE  │  "Where is time spent?"
  │  Find hotspot│  perf, strace, ftrace
  └──────┬───────┘
  ┌──────▼───────┐
  │  3. IDENTIFY │  "Why is it slow there?"
  │  Root cause  │  Blocking I/O? Missed VSync? Copy?
  └──────┬───────┘
  ┌──────▼───────┐
  │  4. FIX      │  "Change one thing"
  │  Targeted    │  DMA, cache, algorithm, architecture
  └──────┬───────┘
  ┌──────▼───────┐
  │  5. VERIFY   │  "Did it help?"
  │  Re-measure  │  Same baseline test, compare
  └──────────────┘

The most common mistake: skipping to step 4. Students guess the bottleneck, apply a "fix," and wonder why nothing changed. Always measure first. Always profile before optimizing.


2. Step 1 — Measure: Establish a Baseline

Before changing anything, quantify the problem with numbers.

Frame rate (display applications)

Add a frame counter to your render loop:

struct timespec now, prev;
clock_gettime(CLOCK_MONOTONIC, &prev);
int frame_count = 0;

while (running) {
    render();
    frame_count++;

    clock_gettime(CLOCK_MONOTONIC, &now);
    double elapsed = (now.tv_sec - prev.tv_sec) +
                     (now.tv_nsec - prev.tv_nsec) / 1e9;
    if (elapsed >= 1.0) {
        printf("FPS: %d\n", frame_count);
        frame_count = 0;
        prev = now;
    }
}

Or in Python:

import time
t0 = time.monotonic()
frames = 0
while True:
    process_frame()
    frames += 1
    if time.monotonic() - t0 >= 1.0:
        print(f"FPS: {frames}")
        frames = 0
        t0 = time.monotonic()

CPU usage

# Overall system CPU
mpstat 1

# Per-process CPU
pidstat -p $(pgrep my_app) 1

# Per-thread CPU (inside a process)
pidstat -t -p $(pgrep my_app) 1

Memory usage

# Per-process memory (RSS = physical, VSZ = virtual)
ps -o pid,rss,vsz,comm -p $(pgrep my_app)

# System-wide
free -m

Timing individual stages

Instrument your pipeline by timing each stage:

struct timespec t0, t1, t2, t3;

clock_gettime(CLOCK_MONOTONIC, &t0);
sensor_read();
clock_gettime(CLOCK_MONOTONIC, &t1);
process_data();
clock_gettime(CLOCK_MONOTONIC, &t2);
render();
clock_gettime(CLOCK_MONOTONIC, &t3);

printf("sensor: %.1f ms  process: %.1f ms  render: %.1f ms\n",
       diff_ms(t0, t1), diff_ms(t1, t2), diff_ms(t2, t3));
Tip

Write timing data to a CSV file, not just the terminal. This lets you plot histograms and calculate percentiles later. See the Jitter Measurement tutorial for the CSV logging pattern.


3. Step 2 — Profile: Find Where Time Is Spent

Once you know how slow the system is, find where the time goes.

perf — CPU Profiling

perf samples where the CPU is executing and builds a statistical profile. It answers: "which functions consume the most CPU time?"

# Record 10 seconds of CPU samples for your application
sudo perf record -g -p $(pgrep my_app) -- sleep 10

# Show the results — sorted by CPU time
sudo perf report

Example output:

  Overhead  Command   Shared Object     Symbol
  --------  -------   ----------------  --------------------------
    32.50%  my_app    my_app            [.] render_horizon
    18.20%  my_app    libc.so           [.] memcpy
    12.80%  my_app    my_app            [.] imu_read_spi
     9.40%  my_app    libSDL2.so        [.] SDL_RenderPresent
     ...

Reading the output: render_horizon takes 32.5% of CPU time. If the app is CPU-bound, this is where optimization matters. If memcpy is 18%, you are copying data unnecessarily.

perf — Counting Events

For quick checks without recording:

# Count cache misses, branch mispredictions, instructions
sudo perf stat -p $(pgrep my_app) -- sleep 5
  Performance counter stats for process '1234':

       1,245,678,901      instructions
          42,345,678      cache-misses       (3.4% of cache refs)
           1,234,567      branch-misses      (0.8% of branches)
              5.003       seconds time elapsed

High cache-miss ratio (>5%) suggests data layout problems. High branch-miss ratio (>5%) suggests unpredictable conditional logic.

strace — System Call Profiling

strace traces every system call your application makes. It answers: "is the app spending time in the kernel?"

# Summary: which syscalls take the most time
sudo strace -c -p $(pgrep my_app) -e trace=all
  % time     seconds  usecs/call     calls    errors syscall
  ------ ----------- ----------- --------- --------- ----------------
   45.32    0.234000         234      1000           read
   22.10    0.114000          57      2000           ioctl
   15.80    0.081600          41      2000           write
    8.50    0.043900          44      1000           clock_nanosleep

Reading the output: If read dominates and each call takes 234 us, the sensor read is blocking. If ioctl dominates, the DRM/KMS calls are slow. If clock_nanosleep dominates, the app is mostly idle (good — means it is not CPU-bound).

strace — Individual Calls

# Show every read() call with timestamps and duration
sudo strace -T -e trace=read -p $(pgrep my_app) 2>&1 | head -20
read(4, "\x00\x01\x02...", 14) = 14 <0.001234>
read(4, "\x00\x01\x02...", 14) = 14 <0.001198>
read(4, "\x00\x01\x02...", 14) = 14 <0.045321>  ← outlier!

The <0.045321> shows one read took 45 ms — 40x longer than normal. This is a scheduling delay (the thread was preempted during the SPI transfer).

ftrace — Kernel-Level Tracing

For deep investigation of scheduling, interrupts, and driver behavior:

# Trace scheduling events for your process
sudo trace-cmd record -p function_graph -F my_app -- sleep 5
sudo trace-cmd report | less

Or use ftrace directly:

# Enable function_graph tracer
echo function_graph | sudo tee /sys/kernel/debug/tracing/current_tracer
echo $PID | sudo tee /sys/kernel/debug/tracing/set_ftrace_pid

# Read trace
sudo cat /sys/kernel/debug/tracing/trace | head -50

ftrace shows function call graphs with execution times:

  my_app-1234  [001]  1234.567890: |  spi_sync() {
  my_app-1234  [001]  1234.567891: |    spi_transfer_one() {
  my_app-1234  [001]  1234.567920: |    } /* 29 us */
  my_app-1234  [001]  1234.567921: |  } /* 31 us */
Warning

ftrace has significant overhead. Use it for targeted investigation of specific functions, not for always-on monitoring. Disable it after profiling.


4. Step 3 — Identify: Common Bottleneck Patterns

Once profiling points to a function or syscall, identify the root cause:

Pattern: Blocking I/O in the Render Loop

Symptom: strace -c shows read() taking 30%+ of time. FPS drops when sensor is slow.

Root cause: Sensor read is synchronous in the render thread. When the sensor takes longer than expected (bus contention, scheduling delay), the frame is late.

Fix: Move sensor read to a separate thread. Share data via atomic variable or mutex.

Before: [read sensor]──[process]──[render]──[flip]  ← serial, blocked
After:  Thread 1: [read]──[read]──[read]──[read]    ← independent
        Thread 2: [render]──[flip]──[render]──[flip]  ← uses last value

Pattern: Unnecessary Memory Copies

Symptom: perf report shows memcpy in top 5. High CPU usage for simple rendering.

Root cause: Data is copied between buffers instead of processed in-place. Common in camera pipelines (capture buffer → processing buffer → display buffer).

Fix: Use pointer swapping instead of copying. Map buffers once and operate on them directly.

Pattern: Missed VSync

Symptom: Render takes 15 ms, but FPS is 30 instead of 60. Frame timing shows alternating 16.7 ms and 33.4 ms gaps.

Root cause: Render time is close to the 16.7 ms budget. Some frames take slightly longer, missing the VBlank by microseconds, and must wait an entire extra frame period.

Fix: Reduce render complexity (fewer draw calls, simpler geometry), or switch to triple buffering (trades latency for throughput).

Pattern: Python GIL Contention

Symptom: Python app uses 100% of one core but FPS is low. Multi-threaded but not faster.

Root cause: The Global Interpreter Lock (GIL) prevents true parallelism. Threads take turns running Python code.

Fix: Use multiprocessing instead of threading for CPU-bound work. Or move hot loops to C (NumPy/OpenCV operations already bypass the GIL).

Pattern: Excessive Logging

Symptom: strace -c shows write() to stderr/journal taking significant time. CPU spikes correlate with log output.

Root cause: print() or printf() in the hot loop. Each call is a syscall, often with string formatting overhead.

Fix: Log at reduced rate (every 100th frame), or use a ring buffer that writes in batch.

Pattern: Thermal Throttling

Symptom: Performance degrades over time. CPU frequency drops. dmesg shows throttling warnings.

Root cause: SoC overheats under sustained load. The kernel reduces CPU frequency to stay within thermal limits.

Fix: Add a heatsink. Reduce sustained CPU load. Check with:

# Monitor CPU frequency and temperature
watch -n 1 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq && \
             cat /sys/class/thermal/thermal_zone0/temp'

5. Tool Selection Guide

"My app is slow"
├── CPU usage high (>50%)? ──► perf record → perf report
│   "Which function eats the CPU?"
├── CPU usage low but FPS low? ──► strace -c
│   "Is it blocking on I/O or sleeping?"
├── Occasional spikes / jitter? ──► strace -T or ftrace
│   "What causes the outliers?"
├── Slow after sustained load? ──► Check thermal throttling
│   "Is the CPU frequency dropping?"
└── Everything looks fast but display stutters? ──► Check VSync
    "Missed VBlank deadlines?"

Quick Reference: When to Use Each Tool

Tool What It Measures When to Use Overhead
time ./my_app Total wall-clock time Quick sanity check None
mpstat 1 Per-core CPU utilization Always — first tool to run Negligible
pidstat -t Per-thread CPU and I/O Multi-threaded apps Low
perf stat Hardware counters (cache, branches) Suspected CPU-bound issues Negligible
perf record + report Function-level CPU profiling Finding hot functions Low (~2%)
strace -c Syscall time summary I/O-heavy or blocking issues Medium (~10%)
strace -T -e read Individual call durations Finding slow I/O calls Medium
ftrace / trace-cmd Kernel function timing Driver or scheduling issues High
cyclictest Scheduling latency RT jitter measurement Dedicated
valgrind --tool=callgrind Instruction-level profiling Algorithmic optimization Very high (20-50x)
Warning

Do not profile with multiple tools simultaneously. Each tool adds overhead that distorts the others' measurements. Run one tool at a time, against the same workload.


6. Embedded-Specific Profiling Considerations

Cross-compilation and debug symbols

perf report needs debug symbols to show function names. When cross-compiling:

# Build with debug symbols (does not affect performance)
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo

RelWithDebInfo keeps optimizations on but includes symbol tables. Do not use Debug for profiling — it disables optimizations and gives misleading results.

perf on Raspberry Pi

perf is available on Raspberry Pi OS:

sudo apt install linux-perf

On custom Buildroot images, enable BR2_PACKAGE_LINUX_TOOLS_PERF=y.

Profiling without root

Most perf operations require root or the perf_event_paranoid sysctl:

# Allow perf for all users (development only, not production)
sudo sysctl kernel.perf_event_paranoid=-1

Remote profiling

When the Pi has no display (headless):

# Record on Pi
sudo perf record -g -p $(pgrep my_app) -- sleep 10
# Copy to host for analysis
scp linux@pi:perf.data .
perf report -i perf.data

7. Example: Profiling a Slow Camera Pipeline

Walk through the workflow with a concrete example from the Ball Detection tutorial.

Problem: Ball detection runs at 17 FPS. Target is 30 FPS.

Step 1: Measure baseline

# Already built into the tutorial — toggle FPS display with 'f'
# Output: "FPS: 17.2 | Pipeline: blur+morph+circ"

Step 2: Profile

# Which functions take CPU time?
sudo perf record -g -p $(pgrep python3) -- sleep 10
sudo perf report
  Overhead  Symbol
  --------  ------------------------------------------
    28.30%  cv::morphologyEx
    22.10%  cv::GaussianBlur
    18.50%  cv::findContours
    12.40%  cv::cvtColor
     8.20%  cv::threshold

Step 3: Identify

morphologyEx (28%) and GaussianBlur (22%) dominate. Together they take 50% of frame time. On a matte black surface with a white ball, high-contrast conditions may not need these preprocessing steps.

Step 4: Fix

Disable morphology (press m) and blur (press b) in the pipeline.

Step 5: Verify

Before: FPS: 17.2 | Pipeline: blur+morph+circ
After:  FPS: 28.4 | Pipeline: threshold+circ

Result: 65% FPS improvement by removing two pipeline stages that were unnecessary for the specific setup. No code optimization needed — the fix was doing less work.

Tip

The cheapest optimization is removing unnecessary work. Profile first — you may find that 40% of CPU time goes to a feature that is not needed for your specific use case.


8. Example: Profiling SDL2 Frame Drops

Problem: SDL2 level display runs at 60 FPS on the bench but drops to 45 FPS when the data logger runs in the background.

Step 1: Measure

# Level display with CSV logging
sudo SDL_VIDEODRIVER=kmsdrm ./level_sdl2 -l test.csv &

# Start the data logger
sudo systemctl start data-logger.service

# Check FPS in the CSV
awk -F, 'NR>1 {print 1e9/$3}' test.csv | sort -n | tail -5

Step 2: Profile

# Per-core CPU usage
mpstat -P ALL 1 5
CPU    %usr   %sys   %iowait   %idle
 0     45.2   12.3     8.1      34.4    ← level_sdl2 + data-logger
 1      0.1    0.0     0.0      99.9
 2      0.1    0.0     0.0      99.9
 3      0.1    0.0     0.0      99.9

Both applications share Core 0. The data logger's disk I/O (%iowait = 8.1%) delays the render thread.

Step 3: Identify

CPU contention on Core 0. The logger's write() calls trigger disk I/O that causes scheduling delays for the render thread.

Step 4: Fix

Pin the logger to a different core:

# In data-logger.service
ExecStart=/usr/bin/taskset -c 2 /usr/local/bin/data_logger

Step 5: Verify

Before: 45 FPS, Core 0 at 65% (shared)
After:  60 FPS, Core 0 at 48%, Core 2 at 12% (separated)

Result: Core partitioning solved the problem. No code changes needed.


9. The Optimization Priority List

When profiling reveals a bottleneck, fix in this order — cheapest and most impactful first:

Priority Optimization Effort Impact Example
1 Remove unnecessary work Minutes High Disable morphology in ball detection
2 Fix architecture Hours High Move blocking I/O off the render thread
3 Use the right API Hours Medium DRM page flip instead of fbdev memcpy
4 Core partitioning Minutes Medium isolcpus, taskset
5 Algorithm improvement Hours Medium Binary search instead of linear scan
6 Reduce copies Hours Medium Pointer swap instead of memcpy
7 DMA for peripherals Days Medium SPI DMA for high-rate sensors
8 Cache optimization Days Low-Medium Structure packing, prefetch
9 Assembly / SIMD Days Low NEON intrinsics for image processing
Warning

Do not jump to priority 7-9 without checking 1-3 first. Most student projects are slow because of architecture (blocking I/O, shared cores) or unnecessary work, not because of cache misses.


Summary

Step Tool Question
Measure FPS counter, mpstat, pidstat How slow is it?
Profile perf record, strace -c Where does time go?
Identify Pattern matching (see Section 4) Why is it slow there?
Fix Targeted change (see Priority List) Change one thing
Verify Same baseline measurement Did it help?

The workflow is always the same. The tools change depending on whether the bottleneck is CPU-bound (perf), I/O-bound (strace), scheduling-related (ftrace/cyclictest), or thermal.

The single most important rule: measure before optimizing, measure after optimizing, and change only one thing at a time.