Audio Pipeline Latency — Measure, Understand, Optimize

Time: 60 min | Prerequisites: I2S Audio Visualizer

This tutorial teaches how real-time audio pipelines work on Linux by measuring, calculating, and optimizing end-to-end latency in the Audio Visualizer Full demo. You'll learn where latency comes from, why it matters, and how to trade off latency vs. stability — a core embedded systems skill applicable far beyond audio.

1. The Pipeline

Sound travels through several stages between the microphone and the speaker. Each stage adds delay:

┌─────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌─────────┐
│ I2S Mic │──▶│ ALSA     │──▶│ Ring     │──▶│ DSP      │──▶│ ALSA     │──▶│ Speaker │
│         │   │ Capture  │   │ Buffer   │   │ (filter, │   │ Playback │   │         │
│         │   │ Period   │   │          │   │  EQ, FX) │   │ Buffer   │   │         │
└─────────┘   └──────────┘   └──────────┘   └──────────┘   └──────────┘   └─────────┘
     │              │              │              │              │              │
     └── analog ────┘── digital ───┘── software ──┘── digital ───┘── analog ────┘

Stage	What happens	Latency source
I2S mic	ADC converts sound to digital	~1 sample (negligible)
ALSA capture period	Kernel collects N samples before waking the app	`N / sample_rate`
Ring buffer	App picks up data on next render frame	0–1 period (depends on timing)
DSP processing	Filters, FFT, EQ, effects	< 1 ms on Pi 4
ALSA playback buffer	Kernel buffers M samples before sending to DAC	`M / sample_rate`
DAC + analog	Digital-to-analog conversion	~1 sample (negligible)

The dominant factors are the ALSA period sizes — the rest is negligible on modern hardware.

2. Calculate Theoretical Latency

The minimum latency is determined by the capture period plus the playback buffer:

Latency_min = T_capture + T_playback

T_capture  = period_frames / sample_rate
T_playback = playback_buffer / sample_rate
           ≈ 2 × period_frames / sample_rate  (ALSA uses 2 periods by default)

Latency_min ≈ 3 × period_frames / sample_rate

For the default 1024 frames at 48 kHz:

T_capture  = 1024 / 48000 = 21.3 ms
T_playback = 2048 / 48000 = 42.7 ms
Latency_min ≈ 64 ms

Note

64 ms is perceptible. Clap your hands near the microphone — you'll hear the original clap, then the playback echo ~64 ms later. Musicians notice anything above 10 ms. For telephony, 150 ms is the threshold where conversation feels awkward.

3. Measure It

3.1 Run with default settings

# Build if needed
cd ~/embedded-linux/apps/i2s-audio-viz && make audio_viz_full

# Run in stereo with playback enabled
./audio_viz_full -d hw:1,0 -c 2 -f

Press PLAY to enable audio output. The display shows: - Vis: capture-to-display latency (visual pipeline) - FX: capture-to-speaker latency (audio pipeline)

Open the EQ overlay — the bottom status bar shows the latency breakdown:

Cap:21ms + Buf:43ms = 64ms

3.2 Run with low-latency flag

./audio_viz_full -d hw:1,0 -c 2 -f -l

The -l flag reduces the period to 256 samples:

Cap:5ms + Buf:11ms = 16ms

Tip

Hear the difference: with playback ON, clap near the mic. - Default (64 ms): you hear a distinct echo after your clap - Low-latency (16 ms): the echo is barely noticeable - Ultra-low (-n 64, ~4 ms): echo is gone, but you may hear clicks/pops

3.3 Sweep period sizes

Try each and note the latency and audio quality:

Period (`-n`)	Capture	Playback buf	Total	Quality
2048	42.7 ms	~85 ms	~128 ms	Perfect, obvious echo
1024 (default)	21.3 ms	~43 ms	~64 ms	Perfect, noticeable echo
512	10.7 ms	~21 ms	~32 ms	Good, slight echo
256 (`-l`)	5.3 ms	~11 ms	~16 ms	Good, barely perceptible
128	2.7 ms	~5 ms	~8 ms	May glitch on loaded system
64	1.3 ms	~3 ms	~4 ms	Likely glitches (underruns)

Warning

Below ~128 samples, the CPU has less than 2.7 ms to process each block. If the scheduler delays the audio thread (context switch, CPU spike from the display), the playback buffer runs empty → audible pop/click (underrun). This is the fundamental latency–reliability tradeoff.

4. Where the Time Goes

4.1 ALSA Period (Capture)

ALSA captures audio in periods — the kernel fills a buffer of N samples, then wakes the application. The app can't process audio until the period is complete:

Time ──────────────────────────────────────────▶

Mic:  |sample|sample|sample| ... |sample|  ← N samples
      │◄──────── period ────────────►│
                                      │
                                      └─ kernel wakes app HERE
                                         (N/48000 seconds after first sample)

Smaller period = lower latency but higher CPU overhead (more wakeups/second) and more risk of underruns.

4.2 ALSA Buffer (Playback)

The playback side has a buffer of typically 2 periods. The app writes a period, ALSA starts playing it. While that plays, the app writes the next period:

ALSA playback buffer: [Period A][Period B]
                       ▲ playing  ▲ app writes here

If the app doesn't write the next period before Period A finishes → underrun (silence gap, audible click).

4.3 Software Processing

All DSP (HP filter, EQ, FFT, effects) runs between capture and playback. At 1024 samples on a Pi 4:

Operation	Typical time
High-pass filter	0.01 ms
8-band biquad EQ	0.05 ms
FFT (1024-point)	0.05 ms
GCC-PHAT	0.2 ms
Voice FX	0.02 ms
SDL2 rendering	2–5 ms
Total	~3 ms

Processing is < 5% of the 21 ms budget — not the bottleneck. The latency is almost entirely ALSA buffering.

5. The Latency–Reliability Tradeoff

This is one of the most important concepts in real-time systems:

Low latency ◄─────────────────────────────────► High reliability
  Small buffers                                   Large buffers
  Tight deadlines                                 Lots of slack
  Glitches possible                               Always smooth

  Professional audio:    Consumer audio:    VoIP:         Streaming:
     ~2-5 ms                ~20-40 ms       ~40-150 ms     ~1-5 sec

Why can't we just use tiny buffers?

Linux is not a hard real-time OS. The scheduler might: - Run a different process for a few milliseconds - Handle an interrupt (network, USB, display) - Trigger garbage collection or memory management

If the audio thread is delayed by just 3 ms and the period is only 2 ms → underrun.

How professionals solve it

Technique	How	Used in
`SCHED_FIFO`	Real-time scheduler priority	JACK, PipeWire
CPU isolation	Pin audio thread to dedicated core	Pro audio
`PREEMPT_RT`	Real-time kernel patches	Low-latency distros
`memlock`	Prevent audio buffers from being swapped	All pro audio
Kernel tuning	Disable CPU frequency scaling, disable USB polling	Dedicated audio

Tip

Try it: Compare latency with and without real-time scheduling:

# Normal scheduling (default)
./audio_viz_full -d hw:1,0 -c 2 -f -n 128

# Real-time scheduling (needs root or rtprio permission)
sudo chrt -f 50 ./audio_viz_full -d hw:1,0 -c 2 -f -n 128

With chrt -f 50, the audio thread gets FIFO priority — it preempts everything except the kernel. You should be able to use smaller periods without glitches.

6. Measuring with a Physical Setup

For a precise end-to-end measurement that includes the analog path:

6.1 Loopback Test

Connect the speaker output close to the microphone (or use a cable)
Generate a click/impulse (tap the desk)
The visualizer shows the original impulse AND the playback impulse
The time difference between them = total pipeline latency

6.2 Oscilloscope Method

For the most accurate measurement: 1. Feed a square wave into the microphone (or tap to create an impulse) 2. Probe the I2S BCLK pin (capture) and the audio output (playback) with an oscilloscope 3. Measure the time between the input edge and the output response

This reveals the true end-to-end latency including kernel scheduling jitter.

7. Connection to Real-Time Systems

This pipeline latency exercise connects directly to the PREEMPT_RT Latency and Jitter Measurement tutorials:

Audio pipeline concept	General RT concept
ALSA period = deadline	Task period
Underrun = missed deadline	Deadline miss
Buffer size = slack	Worst-case execution time margin
`SCHED_FIFO` for audio thread	Real-time scheduling
CPU isolation for audio core	Core affinity
Measuring pipeline latency	Measuring control loop latency

The same principles apply to any real-time embedded system: - Motor control: sensor → compute → actuator pipeline - Robotics: camera → vision → motor command pipeline - Industrial: PLC scan cycle timing - Automotive: CAN bus message → ECU processing → actuator response

Exercises

Tip

Exercise 1: Latency Budget Table Fill in this table by running audio_viz_full with different -n values and PLAY enabled. Note the displayed latency and whether you hear glitches:

Period	Calculated latency	Displayed latency	Glitches?
2048	___ ms	___ ms	Y/N
1024	___ ms	___ ms	Y/N
512	___ ms	___ ms	Y/N
256	___ ms	___ ms	Y/N
128	___ ms	___ ms	Y/N
64	___ ms	___ ms	Y/N

Tip

Exercise 2: Real-Time Scheduling Run with -n 128 both with and without sudo chrt -f 50. Does real-time scheduling eliminate the glitches? Why or why not?

Tip

Exercise 3: CPU Load Test While audio_viz_full is running with -n 256 -l, open another terminal and run stress --cpu 4. Does the audio quality degrade? What about with chrt -f 50?

Tip

Exercise 4: Compare Visual vs Audio Latency With PLAY enabled, clap near the mic and watch the waveform display. The visual response is nearly instant (just the capture period). The audio playback lags behind. Why is the visual latency lower than the audio latency? (Hint: the display doesn't need an output buffer.)

Tip

Exercise 5: Calculate for a Control System A motor control loop reads an encoder, computes PID, and outputs a PWM signal. The loop runs at 10 kHz (100 µs period). If the computation takes 30 µs: - What is the theoretical minimum latency from encoder read to PWM update? - How much slack does the system have? - What happens if an interrupt takes 80 µs?

Compare this to the audio pipeline: what's the equivalent of "period size" and "buffer size" in the motor controller?