Performance Profiling for Embedded Linux

Goal: Learn a systematic workflow for finding and fixing performance bottlenecks in your own applications — not just following a guided optimization tutorial, but knowing what tool to use when your project is slow and you do not know why.

1. The Profiling Workflow

Every performance problem follows the same cycle:

  ┌──────────────┐
  │  1. MEASURE  │  "How slow is it?"
  │  Baseline    │  FPS, latency, CPU%, memory
  └──────┬───────┘
         │
  ┌──────▼───────┐
  │  2. PROFILE  │  "Where is time spent?"
  │  Find hotspot│  perf, strace, ftrace
  └──────┬───────┘
         │
  ┌──────▼───────┐
  │  3. IDENTIFY │  "Why is it slow there?"
  │  Root cause  │  Blocking I/O? Missed VSync? Copy?
  └──────┬───────┘
         │
  ┌──────▼───────┐
  │  4. FIX      │  "Change one thing"
  │  Targeted    │  DMA, cache, algorithm, architecture
  └──────┬───────┘
         │
  ┌──────▼───────┐
  │  5. VERIFY   │  "Did it help?"
  │  Re-measure  │  Same baseline test, compare
  └──────────────┘

The most common mistake: skipping to step 4. Students guess the bottleneck, apply a "fix," and wonder why nothing changed. Always measure first. Always profile before optimizing.

2. Step 1 — Measure: Establish a Baseline

Before changing anything, quantify the problem with numbers.

Frame rate (display applications)

Add a frame counter to your render loop:

struct timespec now, prev;
clock_gettime(CLOCK_MONOTONIC, &prev);
int frame_count = 0;

while (running) {
    render();
    frame_count++;

    clock_gettime(CLOCK_MONOTONIC, &now);
    double elapsed = (now.tv_sec - prev.tv_sec) +
                     (now.tv_nsec - prev.tv_nsec) / 1e9;
    if (elapsed >= 1.0) {
        printf("FPS: %d\n", frame_count);
        frame_count = 0;
        prev = now;
    }
}

Or in Python:

import time
t0 = time.monotonic()
frames = 0
while True:
    process_frame()
    frames += 1
    if time.monotonic() - t0 >= 1.0:
        print(f"FPS: {frames}")
        frames = 0
        t0 = time.monotonic()

CPU usage

# Overall system CPU
mpstat 1

# Per-process CPU
pidstat -p $(pgrep my_app) 1

# Per-thread CPU (inside a process)
pidstat -t -p $(pgrep my_app) 1

Memory usage

# Per-process memory (RSS = physical, VSZ = virtual)
ps -o pid,rss,vsz,comm -p $(pgrep my_app)

# System-wide
free -m

Timing individual stages

Instrument your pipeline by timing each stage:

struct timespec t0, t1, t2, t3;

clock_gettime(CLOCK_MONOTONIC, &t0);
sensor_read();
clock_gettime(CLOCK_MONOTONIC, &t1);
process_data();
clock_gettime(CLOCK_MONOTONIC, &t2);
render();
clock_gettime(CLOCK_MONOTONIC, &t3);

printf("sensor: %.1f ms  process: %.1f ms  render: %.1f ms\n",
       diff_ms(t0, t1), diff_ms(t1, t2), diff_ms(t2, t3));

Tip

Write timing data to a CSV file, not just the terminal. This lets you plot histograms and calculate percentiles later. See the Jitter Measurement tutorial for the CSV logging pattern.

3. Step 2 — Profile: Find Where Time Is Spent

Once you know how slow the system is, find where the time goes.

perf — CPU Profiling

perf samples where the CPU is executing and builds a statistical profile. It answers: "which functions consume the most CPU time?"

# Record 10 seconds of CPU samples for your application
sudo perf record -g -p $(pgrep my_app) -- sleep 10

# Show the results — sorted by CPU time
sudo perf report

Example output:

  Overhead  Command   Shared Object     Symbol
  --------  -------   ----------------  --------------------------
    32.50%  my_app    my_app            [.] render_horizon
    18.20%  my_app    libc.so           [.] memcpy
    12.80%  my_app    my_app            [.] imu_read_spi
     9.40%  my_app    libSDL2.so        [.] SDL_RenderPresent
     ...

Reading the output: render_horizon takes 32.5% of CPU time. If the app is CPU-bound, this is where optimization matters. If memcpy is 18%, you are copying data unnecessarily.

perf — Counting Events

For quick checks without recording:

# Count cache misses, branch mispredictions, instructions
sudo perf stat -p $(pgrep my_app) -- sleep 5

  Performance counter stats for process '1234':

       1,245,678,901      instructions
          42,345,678      cache-misses       (3.4% of cache refs)
           1,234,567      branch-misses      (0.8% of branches)
              5.003       seconds time elapsed

High cache-miss ratio (>5%) suggests data layout problems. High branch-miss ratio (>5%) suggests unpredictable conditional logic.

strace — System Call Profiling

strace traces every system call your application makes. It answers: "is the app spending time in the kernel?"

# Summary: which syscalls take the most time
sudo strace -c -p $(pgrep my_app) -e trace=all

  % time     seconds  usecs/call     calls    errors syscall
  ------ ----------- ----------- --------- --------- ----------------
   45.32    0.234000         234      1000           read
   22.10    0.114000          57      2000           ioctl
   15.80    0.081600          41      2000           write
    8.50    0.043900          44      1000           clock_nanosleep

Reading the output: If read dominates and each call takes 234 us, the sensor read is blocking. If ioctl dominates, the DRM/KMS calls are slow. If clock_nanosleep dominates, the app is mostly idle (good — means it is not CPU-bound).

strace — Individual Calls

# Show every read() call with timestamps and duration
sudo strace -T -e trace=read -p $(pgrep my_app) 2>&1 | head -20

read(4, "\x00\x01\x02...", 14) = 14 <0.001234>
read(4, "\x00\x01\x02...", 14) = 14 <0.001198>
read(4, "\x00\x01\x02...", 14) = 14 <0.045321>  ← outlier!

The <0.045321> shows one read took 45 ms — 40x longer than normal. This is a scheduling delay (the thread was preempted during the SPI transfer).

ftrace — Kernel-Level Tracing

For deep investigation of scheduling, interrupts, and driver behavior:

# Trace scheduling events for your process
sudo trace-cmd record -p function_graph -F my_app -- sleep 5
sudo trace-cmd report | less

Or use ftrace directly:

# Enable function_graph tracer
echo function_graph | sudo tee /sys/kernel/debug/tracing/current_tracer
echo $PID | sudo tee /sys/kernel/debug/tracing/set_ftrace_pid

# Read trace
sudo cat /sys/kernel/debug/tracing/trace | head -50

ftrace shows function call graphs with execution times:

  my_app-1234  [001]  1234.567890: |  spi_sync() {
  my_app-1234  [001]  1234.567891: |    spi_transfer_one() {
  my_app-1234  [001]  1234.567920: |    } /* 29 us */
  my_app-1234  [001]  1234.567921: |  } /* 31 us */

Warning

ftrace has significant overhead. Use it for targeted investigation of specific functions, not for always-on monitoring. Disable it after profiling.

4. Step 3 — Identify: Common Bottleneck Patterns

Once profiling points to a function or syscall, identify the root cause:

Pattern: Blocking I/O in the Render Loop

Symptom: strace -c shows read() taking 30%+ of time. FPS drops when sensor is slow.

Root cause: Sensor read is synchronous in the render thread. When the sensor takes longer than expected (bus contention, scheduling delay), the frame is late.

Fix: Move sensor read to a separate thread. Share data via atomic variable or mutex.

Before: [read sensor]──[process]──[render]──[flip]  ← serial, blocked
After:  Thread 1: [read]──[read]──[read]──[read]    ← independent
        Thread 2: [render]──[flip]──[render]──[flip]  ← uses last value

Pattern: Unnecessary Memory Copies

Symptom: perf report shows memcpy in top 5. High CPU usage for simple rendering.

Root cause: Data is copied between buffers instead of processed in-place. Common in camera pipelines (capture buffer → processing buffer → display buffer).

Fix: Use pointer swapping instead of copying. Map buffers once and operate on them directly.

Pattern: Missed VSync

Symptom: Render takes 15 ms, but FPS is 30 instead of 60. Frame timing shows alternating 16.7 ms and 33.4 ms gaps.

Root cause: Render time is close to the 16.7 ms budget. Some frames take slightly longer, missing the VBlank by microseconds, and must wait an entire extra frame period.

Fix: Reduce render complexity (fewer draw calls, simpler geometry), or switch to triple buffering (trades latency for throughput).

Pattern: Python GIL Contention

Symptom: Python app uses 100% of one core but FPS is low. Multi-threaded but not faster.

Root cause: The Global Interpreter Lock (GIL) prevents true parallelism. Threads take turns running Python code.

Fix: Use multiprocessing instead of threading for CPU-bound work. Or move hot loops to C (NumPy/OpenCV operations already bypass the GIL).

Pattern: Excessive Logging

Symptom: strace -c shows write() to stderr/journal taking significant time. CPU spikes correlate with log output.

Root cause: print() or printf() in the hot loop. Each call is a syscall, often with string formatting overhead.

Fix: Log at reduced rate (every 100th frame), or use a ring buffer that writes in batch.

Pattern: Thermal Throttling

Symptom: Performance degrades over time. CPU frequency drops. dmesg shows throttling warnings.

Root cause: SoC overheats under sustained load. The kernel reduces CPU frequency to stay within thermal limits.

Fix: Add a heatsink. Reduce sustained CPU load. Check with:

# Monitor CPU frequency and temperature
watch -n 1 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq && \
             cat /sys/class/thermal/thermal_zone0/temp'

5. Tool Selection Guide

"My app is slow"
│
├── CPU usage high (>50%)? ──► perf record → perf report
│   "Which function eats the CPU?"
│
├── CPU usage low but FPS low? ──► strace -c
│   "Is it blocking on I/O or sleeping?"
│
├── Occasional spikes / jitter? ──► strace -T or ftrace
│   "What causes the outliers?"
│
├── Slow after sustained load? ──► Check thermal throttling
│   "Is the CPU frequency dropping?"
│
└── Everything looks fast but display stutters? ──► Check VSync
    "Missed VBlank deadlines?"

Quick Reference: When to Use Each Tool

Tool	What It Measures	When to Use	Overhead
`time ./my_app`	Total wall-clock time	Quick sanity check	None
`mpstat 1`	Per-core CPU utilization	Always — first tool to run	Negligible
`pidstat -t`	Per-thread CPU and I/O	Multi-threaded apps	Low
`perf stat`	Hardware counters (cache, branches)	Suspected CPU-bound issues	Negligible
`perf record` + `report`	Function-level CPU profiling	Finding hot functions	Low (~2%)
`strace -c`	Syscall time summary	I/O-heavy or blocking issues	Medium (~10%)
`strace -T -e read`	Individual call durations	Finding slow I/O calls	Medium
`ftrace` / `trace-cmd`	Kernel function timing	Driver or scheduling issues	High
`cyclictest`	Scheduling latency	RT jitter measurement	Dedicated
`valgrind --tool=callgrind`	Instruction-level profiling	Algorithmic optimization	Very high (20-50x)

Warning

Do not profile with multiple tools simultaneously. Each tool adds overhead that distorts the others' measurements. Run one tool at a time, against the same workload.

6. Embedded-Specific Profiling Considerations

Cross-compilation and debug symbols

perf report needs debug symbols to show function names. When cross-compiling:

# Build with debug symbols (does not affect performance)
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo

RelWithDebInfo keeps optimizations on but includes symbol tables. Do not use Debug for profiling — it disables optimizations and gives misleading results.

perf on Raspberry Pi

perf is available on Raspberry Pi OS:

sudo apt install linux-perf

On custom Buildroot images, enable BR2_PACKAGE_LINUX_TOOLS_PERF=y.

Profiling without root

Most perf operations require root or the perf_event_paranoid sysctl:

# Allow perf for all users (development only, not production)
sudo sysctl kernel.perf_event_paranoid=-1

Remote profiling

When the Pi has no display (headless):

# Record on Pi
sudo perf record -g -p $(pgrep my_app) -- sleep 10
# Copy to host for analysis
scp linux@pi:perf.data .
perf report -i perf.data

7. Example: Profiling a Slow Camera Pipeline

Walk through the workflow with a concrete example from the Ball Detection tutorial.

Problem: Ball detection runs at 17 FPS. Target is 30 FPS.

Step 1: Measure baseline

# Already built into the tutorial — toggle FPS display with 'f'
# Output: "FPS: 17.2 | Pipeline: blur+morph+circ"

Step 2: Profile

# Which functions take CPU time?
sudo perf record -g -p $(pgrep python3) -- sleep 10
sudo perf report

  Overhead  Symbol
  --------  ------------------------------------------
    28.30%  cv::morphologyEx
    22.10%  cv::GaussianBlur
    18.50%  cv::findContours
    12.40%  cv::cvtColor
     8.20%  cv::threshold

Step 3: Identify

morphologyEx (28%) and GaussianBlur (22%) dominate. Together they take 50% of frame time. On a matte black surface with a white ball, high-contrast conditions may not need these preprocessing steps.

Step 4: Fix

Disable morphology (press m) and blur (press b) in the pipeline.

Step 5: Verify

Before: FPS: 17.2 | Pipeline: blur+morph+circ
After:  FPS: 28.4 | Pipeline: threshold+circ

Result: 65% FPS improvement by removing two pipeline stages that were unnecessary for the specific setup. No code optimization needed — the fix was doing less work.

Tip

The cheapest optimization is removing unnecessary work. Profile first — you may find that 40% of CPU time goes to a feature that is not needed for your specific use case.

8. Example: Profiling SDL2 Frame Drops

Problem: SDL2 level display runs at 60 FPS on the bench but drops to 45 FPS when the data logger runs in the background.

Step 1: Measure

# Level display with CSV logging
sudo SDL_VIDEODRIVER=kmsdrm ./level_sdl2 -l test.csv &

# Start the data logger
sudo systemctl start data-logger.service

# Check FPS in the CSV
awk -F, 'NR>1 {print 1e9/$3}' test.csv | sort -n | tail -5

Step 2: Profile

# Per-core CPU usage
mpstat -P ALL 1 5

CPU    %usr   %sys   %iowait   %idle
 0     45.2   12.3     8.1      34.4    ← level_sdl2 + data-logger
 1      0.1    0.0     0.0      99.9
 2      0.1    0.0     0.0      99.9
 3      0.1    0.0     0.0      99.9

Both applications share Core 0. The data logger's disk I/O (%iowait = 8.1%) delays the render thread.

Step 3: Identify

CPU contention on Core 0. The logger's write() calls trigger disk I/O that causes scheduling delays for the render thread.

Step 4: Fix

Pin the logger to a different core:

# In data-logger.service
ExecStart=/usr/bin/taskset -c 2 /usr/local/bin/data_logger

Step 5: Verify

Before: 45 FPS, Core 0 at 65% (shared)
After:  60 FPS, Core 0 at 48%, Core 2 at 12% (separated)

Result: Core partitioning solved the problem. No code changes needed.

9. The Optimization Priority List

When profiling reveals a bottleneck, fix in this order — cheapest and most impactful first:

Priority	Optimization	Effort	Impact	Example
1	Remove unnecessary work	Minutes	High	Disable morphology in ball detection
2	Fix architecture	Hours	High	Move blocking I/O off the render thread
3	Use the right API	Hours	Medium	DRM page flip instead of fbdev memcpy
4	Core partitioning	Minutes	Medium	`isolcpus`, `taskset`
5	Algorithm improvement	Hours	Medium	Binary search instead of linear scan
6	Reduce copies	Hours	Medium	Pointer swap instead of memcpy
7	DMA for peripherals	Days	Medium	SPI DMA for high-rate sensors
8	Cache optimization	Days	Low-Medium	Structure packing, prefetch
9	Assembly / SIMD	Days	Low	NEON intrinsics for image processing

Warning

Do not jump to priority 7-9 without checking 1-3 first. Most student projects are slow because of architecture (blocking I/O, shared cores) or unnecessary work, not because of cache misses.

Summary

Step	Tool	Question
Measure	FPS counter, `mpstat`, `pidstat`	How slow is it?
Profile	`perf record`, `strace -c`	Where does time go?
Identify	Pattern matching (see Section 4)	Why is it slow there?
Fix	Targeted change (see Priority List)	Change one thing
Verify	Same baseline measurement	Did it help?

The workflow is always the same. The tools change depending on whether the bottleneck is CPU-bound (perf), I/O-bound (strace), scheduling-related (ftrace/cyclictest), or thermal.

The single most important rule: measure before optimizing, measure after optimizing, and change only one thing at a time.