Skip to content

I2S Microphone Audio Visualizer

Time estimate: ~90 minutes Prerequisites: SDL2 Level Display, Display Setup

Learning Objectives

By the end of this tutorial you will be able to:

  • Configure an I2S MEMS microphone on Raspberry Pi using Device Tree overlays
  • Capture audio through ALSA in a dedicated thread
  • Apply real-time DSP: high-pass filtering, windowing, FFT
  • Render waveform, spectrum, and scrolling spectrogram with SDL2
  • (Optional) Use GCC-PHAT cross-correlation on two synchronized channels to estimate sound direction
Why I2S?

The Raspberry Pi has no analogue audio input. I2S (Inter-IC Sound) is a digital serial bus designed for audio — it carries a bit clock (BCLK), word select (LRCLK), and data (DOUT) directly from the microphone's ADC. No external codec needed.

Common I2S MEMS microphones like the INMP441 or SPH0645 connect directly to the Pi's PCM/I2S pins and appear as an ALSA capture device once the overlay is loaded. This tutorial builds a real-time audio visualization app on top of that capture stream.


Architecture Overview

┌───────────────┐    ┌──────────────┐    ┌───────────────┐
│  I2S MEMS Mic │───▶│ ALSA Capture │───▶│  Ring Buffer  │
│  (INMP441)    │    │  (thread)    │    │  (lock-free)  │
└───────────────┘    └──────────────┘    └───────┬───────┘
                                        ┌────────────────┐
                                        │  DSP Pipeline  │
                                        │  HP filter     │
                                        │  Hann window   │
                                        │  FFT (FFTW3)   │
                                        │  GCC-PHAT (2ch)│
                                        └────────┬───────┘
                                        ┌────────────────┐
                                        │  SDL2 Renderer │
                                        │  Waveform      │
                                        │  Spectrum bars │
                                        │  Spectrogram   │
                                        │  Direction (2ch)│
                                        └────────────────┘

The design separates audio capture from rendering using a ring buffer and two threads — the same pattern used in the Level Display tutorial for sensor data. The audio thread pushes blocks of samples; the render thread processes and draws the latest block each frame.


1. Hardware Setup

Wiring

Connect an INMP441 (or SPH0645) I2S microphone to the Pi's GPIO header:

Mic Pin Pi GPIO Function
VDD 3.3V Power
GND GND Ground
SCK GPIO 18 Bit clock (BCLK)
WS GPIO 19 Word select / LRCLK
SD GPIO 20 Data out (DOUT)
L/R GND Channel select: GND = left, VDD = right
Warning

The INMP441 is a 3.3V device. Do not connect VDD to 5V.

For two-channel capture (stereo), connect a second microphone to the same BCLK/WS/SD lines but tie its L/R pin to VDD (right channel). Both mics share the same data line — one transmits on the left slot, the other on the right.

Enable the I2S Overlay

Add the microphone overlay to /boot/firmware/config.txt:

# For a simple I2S mic (mono or stereo pair on one data line)
echo "dtoverlay=googlevoicehat-soundcard" | sudo tee -a /boot/firmware/config.txt
Tip

The googlevoicehat-soundcard overlay is the simplest way to enable I2S capture on the Pi — it configures the PCM/I2S interface for digital microphones without an external codec. Alternatively, you can use dtoverlay=i2s-mmap or a custom overlay. Check ls /boot/firmware/overlays/ for available options.

Reboot:

sudo reboot

After reboot, verify the sound card appeared:

arecord -l

Expected output:

card 1: sndrpigooglevoi [snd_rpi_googlevoicehat_soundcar], device 0: ...
  Subdevices: 1/1
Checkpoint — I2S Mic Detected

arecord -l shows the I2S sound card. If not, check wiring and the overlay.


2. Test Capture with ALSA

Before writing code, verify audio capture works:

# Record 5 seconds of audio
arecord -D hw:1,0 -f S32_LE -r 48000 -c 2 -d 5 test.wav

# Play back (connect headphones or use another device)
aplay test.wav
Note

I2S mics typically output 24-bit audio in a 32-bit container (S32_LE). The app converts to float internally. If hw:1,0 does not work, check the card number from arecord -l and adjust.

Check the signal:

# Show peak levels (Ctrl+C to stop)
arecord -D hw:1,0 -f S32_LE -r 48000 -c 2 -V mono /dev/null

You should see the VU meter respond to sound. Clap your hands — the meter should spike.


3. Install Dependencies

sudo apt install libsdl2-dev libasound2-dev libfftw3-dev

These provide:

Library Purpose
SDL2 Window, renderer, event loop
ALSA (libasound2) Audio capture API
FFTW3 (libfftw3f) Single-precision FFT

4. Build the Visualizer

The source is in src/embedded-linux/apps/i2s-audio-viz/audio_viz.c:

cd ~/embedded-linux/apps/i2s-audio-viz
make audio_viz

Or build manually:

gcc -Wall -O2 $(sdl2-config --cflags) -o audio_viz audio_viz.c \
    $(sdl2-config --libs) -lasound -lfftw3f -lm -lpthread

5. Run

# Mono (1 mic, default device hw:0)
./audio_viz -d hw:1,0 -c 1

# Stereo with direction detection (2 mics on same I2S line)
./audio_viz -d hw:1,0 -c 2 -m 0.06

# Boost quiet mic signal
./audio_viz -d hw:1,0 -c 1 -g 8.0

# Longer waveform view (200 ms)
./audio_viz -d hw:1,0 -c 1 -w 200

# Fine frequency resolution (4096-point FFT)
./audio_viz -d hw:1,0 -c 1 -n 4096

# Custom sample rate
./audio_viz -d hw:1,0 -r 16000 -n 512

Command-line options:

Flag Default Description
-d hw:0 ALSA device
-r 48000 Sample rate (Hz)
-c 1 Channels (1 = mono, 2 = stereo)
-n 1024 Period size / FFT window
-g 4.0 Software gain (I2S mics are often quiet)
-w 50 Waveform display length in ms
-m 0.06 Mic spacing in metres (for direction)

Press Q or close the window to exit.

Checkpoint — Visualizer Running

You should see: waveform at top, RMS/peak meter, FFT bar spectrum, and scrolling spectrogram. Speak or clap — the display should react in real time.


6. How It Works

Audio Thread and Ring Buffer

The audio thread runs snd_pcm_readi() in a tight loop, capturing one period (1024 frames) at a time. Each block is written to a ring buffer protected by a mutex:

snd_pcm_sframes_t n = snd_pcm_readi(pcm, tmp, period_frames);

pthread_mutex_lock(&ring_mtx);
memcpy(ring_buf + ring_write * slot_size, tmp, n * channels * sizeof(float));
ring_write = (ring_write + 1) % RING_SLOTS;
pthread_mutex_unlock(&ring_mtx);

If the ring fills up (render thread is too slow), the oldest block is dropped. This prevents the audio thread from blocking — it always keeps capturing.

Why Not SDL_AudioSpec?

SDL2 has its own audio API (SDL_OpenAudioDevice), but for I2S microphones on Linux, using ALSA directly gives you more control: you can specify the exact hardware device, format negotiation, and period size. SDL2's audio backend on Linux wraps ALSA anyway.

DSP Pipeline

Each block goes through these steps:

1. High-pass filter — removes DC offset and low-frequency rumble. I2S mics often have a significant DC component:

/* 1st-order IIR high-pass: y[n] = α(y[n-1] + x[n] - x[n-1]) */
float y = alpha * (*y_prev + x - *x_prev);

With α = 0.995 at 48 kHz, the cutoff is approximately 80 Hz.

2. Hann window — multiplies the block by a raised-cosine window to reduce spectral leakage before FFT:

out[i] = in[i] * 0.5f * (1.0f - cosf(2π * i / (N-1)));

Without windowing, the FFT of a rectangular block produces wide spectral lobes that smear energy across bins.

3. FFT — FFTW3 computes the real-to-complex DFT. The output has N/2 + 1 complex bins:

fftwf_execute(plan_fwd);  /* in-place: fft_in → fft_out */

4. Magnitude in dB — each bin's magnitude is converted to decibels for display:

float mag = sqrtf(re*re + im*im) / N;
float db = 20.0f * log10f(mag + 1e-10);

The 1e-10 prevents log10(0).

Visualization

The render loop draws four panels:

Panel What it shows
Waveform Raw time-domain signal (amplitude vs. sample index)
Level meter RMS (green bar) and peak (red marker) in dB
Spectrum FFT magnitude bars, colour-coded by level
Spectrogram Scrolling time × frequency heatmap — each column is one FFT frame

The spectrogram uses an SDL2 texture (SDL_TEXTUREACCESS_STREAMING) that gets one column updated per audio block. The texture is rendered with a scroll offset so the newest data appears on the right edge.


7. Signal Processing Fundamentals

This section covers the core signal processing concepts behind the visualizer. You do not need this to run the app, but it explains why each DSP step exists and what the display is actually showing you.

Tip

See it interactively: Run these demos on your host PC alongside reading this section:

cd ~/embedded-linux/scripts/signal-processing-demo
python sampling_aliasing.py -i     # Section 7.1: why 48 kHz sample rate?
python fft_windowing.py -i         # Section 7.2: why Hann window?
python filter_response.py -i       # Section 7.3: why high-pass filter?
Each demo has sliders to adjust parameters in real time — much easier to build intuition than reading equations alone.

What Is a Signal?

A signal is a quantity that varies over time. For audio, it is air pressure fluctuations captured by the microphone. The MEMS mic converts these tiny pressure changes into a voltage, and the I2S ADC converts that voltage into a stream of digital numbers — one number per sample.

At 48 kHz sample rate, you get 48,000 numbers per second. Each number represents the air pressure at that instant. This stream of numbers is the time-domain representation of the signal.

Sampling and the Nyquist Limit

The sampling theorem says: to faithfully capture a signal, you must sample at least twice the highest frequency present.

f_max = sample_rate / 2    (the "Nyquist frequency")

At 48 kHz, you can capture frequencies up to 24 kHz — well above human hearing (~20 kHz). At 16 kHz, the limit is 8 kHz — enough for speech but not music.

If a frequency above Nyquist is present in the signal, it appears as a lower frequency in the captured data — this artifact is called aliasing. The MEMS mic's internal anti-aliasing filter prevents this.

Sampling and Aliasing — 440 Hz sine at different sample rates

A 440 Hz sine wave sampled at 4 rates. At 2000 Hz and 1000 Hz (above Nyquist), the red reconstruction perfectly matches the blue original. At 700 Hz (below Nyquist), the reconstructed signal is a wrong frequency — aliasing. At 440 Hz (exactly 1×), only DC is captured. Run locally: python scripts/signal-processing-demo/sampling_aliasing.py or -i for interactive slider

Time Domain vs. Frequency Domain

The waveform shows the signal in the time domain: amplitude on the y-axis, time on the x-axis. You can see:

  • Loud vs. quiet (amplitude)
  • Fast vs. slow oscillations (pitch)
  • Transients (clicks, claps)

But you cannot easily see which frequencies are present — a complex sound (speech, music) looks like a messy wave. This is where the frequency domain helps.

The FFT (Fast Fourier Transform) converts a block of time-domain samples into a set of frequency bins, each telling you how much energy is at that frequency. The result is the spectrum — the bar chart in the app.

How the FFT Works (Intuitively)

The FFT answers: "how much of each sine wave frequency is present in this block?"

Given N samples, the FFT produces N/2 + 1 frequency bins:

Bin Frequency What it represents
0 0 Hz (DC) Average value of the block
1 fs/N Hz Lowest frequency detectable
2 2·fs/N Hz Second harmonic
... ... ...
N/2 fs/2 Hz (Nyquist) Highest frequency detectable

Each bin is a complex number (real + imaginary part). The magnitude √(re² + im²) tells you how loud that frequency is. The phase tells you the timing — GCC-PHAT uses this for direction detection.

Why We Use a Window Function

The FFT assumes the input block repeats forever. If the block doesn't start and end at zero, the FFT sees a discontinuity at the block boundary, which spreads energy across all bins — this is spectral leakage.

The Hann window tapers the signal smoothly to zero at both ends:

w[n] = 0.5 · (1 - cos(2π·n / (N-1)))

This eliminates the discontinuity at the cost of slightly reduced frequency resolution. Other windows (Hamming, Blackman, Kaiser) offer different trade-offs between resolution and leakage.

FFT Windowing — spectral leakage comparison

Top: window shapes. Middle: spectrum of 440 Hz + 466 Hz (-20 dB). The rectangular window's leakage completely hides the weak 466 Hz tone. Hann and Blackman reveal it clearly. Bottom: zoomed view showing each window's main lobe width vs side lobe level. Run locally: python scripts/signal-processing-demo/fft_windowing.py or -i for interactive window switching

Decibels (dB)

The spectrum is displayed in decibels because human hearing is logarithmic — we perceive loudness on a ratio scale, not a linear scale.

Level_dB = 20 · log₁₀(amplitude)
dB value Meaning
0 dB Full-scale signal (clipping)
-20 dB 10× quieter than full scale
-40 dB 100× quieter
-60 dB 1000× quieter (quiet room)
-80 dB Noise floor

A change of 6 dB roughly doubles the perceived loudness.

Filtering

A filter selectively passes or removes certain frequencies from a signal.

Input signal → [Filter] → Output signal (some frequencies removed)

The three basic filter types:

Type Passes Blocks Use in this app
High-pass Above cutoff Below cutoff Remove DC offset and rumble (80 Hz)
Low-pass Below cutoff Above cutoff Isolate speech band
Band-pass Between two cutoffs Outside the band Focus on a target frequency range

The app uses a 1st-order IIR (Infinite Impulse Response) high-pass filter — a simple recursive formula that only needs the previous input and output sample:

y[n] = α · (y[n-1] + x[n] - x[n-1])

Higher α → lower cutoff frequency. This is the simplest filter that works in real time with zero latency.

From Single Signal to Two-Microphone Array

With one microphone, you know what sound is present (spectrum) but not where it comes from. Adding a second microphone enables spatial information:

If sound arrives from the left, it reaches the left mic first. The time difference of arrival (TDOA) between the two channels encodes the direction. Section 8 explains how to extract this.


8. Understanding the Display

Reading the Spectrum

The x-axis of the spectrum represents frequency. For a 48 kHz sample rate and 1024-point FFT:

  • Each bin spans 48000 / 1024 ≈ 46.9 Hz
  • Bin 0 = DC (0 Hz)
  • Bin 512 = Nyquist (24 kHz)
  • Human speech fundamental: bins ~2–8 (100–400 Hz)
  • Clap/snap transient: broad peak across many bins

Reading the Spectrogram

  • x-axis: time (scrolling left)
  • y-axis: frequency (0 Hz at bottom, Nyquist at top)
  • colour: magnitude (black = silence, blue = quiet, yellow = loud, white = very loud)

A sustained tone appears as a horizontal line. A clap appears as a vertical bright column (energy at all frequencies simultaneously). Speech shows a characteristic harmonic ladder pattern.

Frequency Resolution vs. Time Resolution

There is a fundamental trade-off:

FFT size Freq. resolution Time resolution Use case
256 ~188 Hz ~5.3 ms Fast transients
512 ~94 Hz ~10.7 ms General purpose
1024 ~47 Hz ~21.3 ms Default — good balance
2048 ~23 Hz ~42.7 ms Fine frequency detail
4096 ~12 Hz ~85.3 ms Music analysis

Use the -n flag to experiment: -n 256 for snappy response, -n 4096 for detailed frequency analysis.


9. Direction Detection (Stereo Mode)

Warning

Direction detection requires two synchronized microphones on the same I2S data line — one set to L channel (L/R → GND), one to R channel (L/R → VDD). Two separate I2S interfaces will not work reliably because their clocks drift.

Stereo Hardware Setup

Wire both INMP441 mics to the same BCLK/WS/SD lines. The only difference is the L/R pin:

Mic 1 (Left) Mic 2 (Right)
SCK GPIO 18 GPIO 18
WS GPIO 19 GPIO 19
SD GPIO 20 GPIO 20
L/R GND VDD (3.3V)

Both mics share the data line but transmit in different time slots (left slot vs. right slot). ALSA delivers them as interleaved stereo frames.

Mount the mics with a known spacing — measure the distance between the two mic holes. The default is 6 cm (-m 0.06).

The Physics: How Direction Creates a Time Delay

When a sound source is directly in front of both mics (broadside), the sound wave arrives at both mics simultaneously — zero delay.

When the source is to the left, sound reaches the left mic first. The extra distance the wave must travel to reach the right mic creates a time difference of arrival (TDOA):

                  Sound source (to the left)
                      /  \
                     /    \
                    /      \
               Mic L ---- Mic R
                 ←── d ──→

  Extra path length = d · sin(θ)
  Time delay τ = d · sin(θ) / c     (c = 343 m/s speed of sound)

For 6 cm spacing: - Source at 90° (hard left): τ = 0.06 / 343 = 175 μs = 8.4 samples at 48 kHz - Source at 45°: τ = 0.06 · sin(45°) / 343 = 124 μs = 5.9 samples - Source at 0° (broadside/center): τ = 0 μs

The maximum delay is only ~8 samples — this is why synchronized clocks matter. Even 1 sample of drift would cause a 12° error.

Cross-Correlation: Finding the Delay

The simplest way to find the TDOA is cross-correlation — slide one signal past the other and measure how well they match at each offset:

R₁₂[k] = Σ x₁[n] · x₂[n + k]     for each lag k

The lag k that produces the highest R₁₂[k] is the estimated delay. This works but produces broad peaks that are hard to pinpoint, especially in reverberant rooms.

GCC-PHAT: A Better Cross-Correlation

Generalized Cross-Correlation with Phase Transform (GCC-PHAT) sharpens the peak dramatically:

1. X₁(f) = FFT(x₁)              — frequency domain of mic 1
2. X₂(f) = FFT(x₂)              — frequency domain of mic 2
3. G₁₂(f) = X₁(f) · X₂*(f)     — cross-power spectrum
4. Ĝ₁₂(f) = G₁₂(f) / |G₁₂(f)| — PHAT: normalize magnitude to 1
5. R₁₂[k] = IFFT(Ĝ₁₂)          — inverse FFT → correlation
6. τ = argmax(R₁₂)              — peak = delay in samples

Step 4 is the key: by dividing out the magnitude, only the phase difference between channels remains. Phase encodes timing; magnitude encodes loudness. By keeping only phase, the result has a sharp spike at the true delay regardless of the signal's spectral shape.

In C, this is:

/* Cross-power with PHAT weighting */
float cre = re1 * re2 + im1 * im2;   /* real part of X1 · conj(X2) */
float cim = im1 * re2 - re1 * im2;   /* imaginary part */
float mag = sqrtf(cre*cre + cim*cim) + 1e-10f;
cross[i][0] = cre / mag;             /* normalize to unit magnitude */
cross[i][1] = cim / mag;

From Delay to Angle

Once you have the delay in samples:

τ = delay_samples / sample_rate          (convert to seconds)
sin(θ) = c · τ / d                       (c = 343 m/s, d = mic spacing)
θ = arcsin(c · τ / d)                    (angle from broadside)

The app smooths the delay estimate with an exponential moving average to avoid jitter:

delay_samples = delay_samples * 0.7f + lag * 0.3f;

Resolution Limits

Parameter Value Effect
Mic spacing 6 cm Max delay = 8.4 samples
Sample rate 48 kHz 1 sample = 20.8 μs = 7.1 mm
Discrete angles ~17 Without interpolation
Angular resolution ~6° At broadside (best case)
Ambiguity Front/back Cannot distinguish with 2 mics

Sub-sample interpolation (parabolic fit around the peak) can improve resolution to ~1°. See the Challenges section.

The Direction Indicator

In stereo mode, the app shows a circular direction indicator in the top-right corner. The yellow line points toward the estimated sound direction; its length represents the confidence (correlation peak height — higher peak = more certain).

./audio_viz -d hw:1,0 -c 2 -m 0.06

Try these experiments: - Clap from different sides — the indicator should swing left/right - Speak steadily — the indicator should track your position - Two people speaking — the indicator jumps between them (it tracks the loudest source) - Snap fingers close to one mic — the indicator should point hard left or right

Tip

If the direction seems inverted (left/right swapped), your mic L/R pin assignment is reversed. Either swap the L/R wires or pass a negative mic distance: -m -0.06.


10. Practical Limits of Multi-Mic Direction Finding

Why Not 5 Separate I2S Mics?

The Raspberry Pi's I2S interface supports 2 channels per data line (left/right slot selection). Common I2S MEMS mics like the INMP441 only transmit on one slot.

To use more than 2 microphones, you would need:

  • TDM (Time Division Multiplexing) — some codecs support 4–8 channels on one data line, but the Pi's PCM hardware has limited TDM support
  • Multiple I2S data lines — Pi 5 can stripe stereo pairs across data lines, but Pi 4 cannot
  • USB multichannel audio — a USB mic array (e.g., ReSpeaker) provides 4–8 synchronized channels as a single ALSA device

Connecting 5 separate unsynchronized I2S mics will not give reliable direction detection — even tiny clock drift between interfaces causes the delay estimate to wander over time.

Setup Channels Direction Notes
1 mic Mono No Waveform + spectrum only
2 mics, same I2S line Stereo Left/right axis Good for basic DOA
USB mic array (4–8 ch) Multi 2D angle Best for real direction finding
5 separate I2S mics 5 × mono Unreliable Clocks drift, TDOA breaks down

What Works for 2 Mics

With 2 synchronized mics you can:

  • Detect left vs. centre vs. right (1D angle on the mic axis)
  • Estimate direction within approximately ±90° from broadside
  • Track a moving source in real time (clap, speech, footsteps)

You cannot determine elevation or distinguish front from back — that requires 3+ mics in a non-collinear arrangement.


11. Filtering Reference

The app applies a simple high-pass filter. Here are other useful filters you can add:

Filter Purpose Implementation
High-pass (80 Hz) Remove DC + rumble 1-pole IIR (included)
Low-pass (8 kHz) Speech band only 1-pole IIR: y = α·x + (1-α)·y_prev
Band-pass (300–3400 Hz) Telephone speech band Chain HP + LP
Moving average Smooth RMS/peak meters Circular buffer average
Exponential smoothing Smooth spectrum display y = α·x + (1-α)·y_prev per bin
Notch (50/60 Hz) Remove mains hum 2nd-order IIR
AGC Auto-gain control Track RMS, scale to target

For the 1st-order IIR high-pass used in this app:

/* Cutoff frequency fc, sample rate fs:
 * α ≈ 1 / (1 + 2π·fc/fs)
 *
 * fc = 80 Hz, fs = 48000 → α ≈ 0.9896
 * We use 0.995 for a slightly higher cutoff.
 */
y[n] = α · (y[n-1] + x[n] - x[n-1])

Challenges

Tip

Try extending the visualizer. Each challenge below has a guided solution with theory explanations — try solving it first, then check the hints if stuck.

  • Noise gate — only update the direction estimate when the RMS level exceeds a threshold (ignore silence)
  • Peak hold — draw a slowly-decaying peak line above the spectrum bars (classic audio meter style)
  • Musical note tuner — detect the dominant frequency and display the nearest musical note (A4 = 440 Hz)
  • WAV recording — add a key (R) to start/stop recording the raw audio to a WAV file
  • Band-pass filter — add a configurable band-pass filter and show the filtered signal alongside the raw signal
  • Sub-sample TDOA — implement parabolic interpolation around the GCC-PHAT peak for better angular resolution
  • CPU budget analysis — measure how much time each processing stage takes, find the FFT size limit

Guided solutions with theory: Audio Visualizer Challenges

Background reading: Signal Processing Reference — sampling, FFT, filtering, DSP architectures

Pipeline deep-dive: Audio Pipeline Latency — measure and optimize end-to-end latency, understand the latency–reliability tradeoff, connect to real-time systems concepts


Full Pipeline Architecture (audio_viz_full)

The full-featured version (audio_viz_full.c) adds real-time playback with voice effects, an 8-band EQ, and TDOA visualization. Understanding its architecture is a real-time systems case study.

Data Flow

                    ┌────────────────────────────────────────────────────┐
                    │              CAPTURE THREAD (SCHED_FIFO 60)        │
                    │                                                    │
  I2S Mic ─────────▶│ ALSA snd_pcm_readi() ──▶ S32→float ──▶ ×gain     │
  (INMP441)         │                                          │        │
                    └──────────────────────────────────────────┼────────┘
                                                    ┌──────────▼──────────┐
                                                    │   Capture Ring Buf  │
                                                    │   (8 slots, mutex)  │
                                                    └──────────┬──────────┘
            ┌──────────────────────────────────────────────────┼───────────┐
            │              RENDER THREAD (main, 60fps vsync)   │           │
            │                                                  ▼           │
            │  ┌─── drain ALL pending blocks (for loop) ───────────────┐  │
            │  │                                                       │  │
            │  │  ┌─────────────┐  ┌────────────┐  ┌───────────────┐  │  │
            │  │  │ HP Filter   │─▶│ LP Filter  │─▶│ 8-band EQ     │  │  │
            │  │  │ (115 Hz)    │  │ (3 kHz opt)│  │ (biquad peak) │  │  │
            │  │  └─────────────┘  └────────────┘  └───────┬───────┘  │  │
            │  │                                           │          │  │
            │  │  ┌────────────┐     ┌─────────────────────┤          │  │
            │  │  │ Delay/Echo │◀────┘                     │          │  │
            │  │  │ (optional) │                           │          │  │
            │  │  └─────┬──────┘                           │          │  │
            │  │        │                                  │          │  │
            │  │        ▼                                  ▼          │  │
            │  │  ┌──────────────────────────┐   ┌────────────────┐  │  │
            │  │  │ PLAYBACK COPY            │   │ WAV Recording  │  │  │
            │  │  │ ┌──────────────────────┐ │   │ (if active)    │  │  │
            │  │  │ │ Noise Gate (opt)     │ │   └────────────────┘  │  │
            │  │  │ ├──────────────────────┤ │                       │  │
            │  │  │ │ Voice FX             │ │                       │  │
            │  │  │ │ ├ Chipmunk (1.5x)    │ │                       │  │
            │  │  │ │ ├ Deep (0.6x stretch)│ │                       │  │
            │  │  │ │ └ Robot (ring mod)   │ │                       │  │
            │  │  │ ├──────────────────────┤ │                       │  │
            │  │  │ │ Hard clip ±1.0       │ │                       │  │
            │  │  │ └──────────────────────┘ │                       │  │
            │  │  └───────────┬──────────────┘                       │  │
            │  │              ▼                                      │  │
            │  │  ┌───────────────────┐                              │  │
            │  │  │ Playback Ring Buf │──── condvar signal ──────┐   │  │
            │  │  │ (8 slots)         │                          │   │  │
            │  │  └───────────────────┘                          │   │  │
            │  └─── end drain loop (repeat for each block) ──────┼──┘  │
            │                                                    │     │
            │  ┌─── LAST block only ─────────────────────────┐   │     │
            │  │  FFT → magnitude → spectrum + spectrogram   │   │     │
            │  │  GCC-PHAT → TDOA → direction indicator      │   │     │
            │  │  RMS, peak, frequency, note detection       │   │     │
            │  │  Waveform history ring buffer                │   │     │
            │  └──────────────────────────┬──────────────────┘   │     │
            │                             ▼                      │     │
            │                      SDL2 Render                   │     │
            │                      + Overlays (EQ, TDOA)         │     │
            │                      + Button bar                  │     │
            └────────────────────────────────────────────────────┼─────┘
            ┌────────────────────────────────────────────────────┼─────┐
            │              PLAYBACK THREAD (SCHED_FIFO 55)       │     │
            │                                                    ▼     │
            │  condvar wait ──▶ accumulate ──▶ float→S16 ──▶ ALSA     │
            │                   (cap 256 →                  writei    │
            │                    play 480)                            │
            └─────────────────────────────────────────────────────────┘
                                   Headphone Jack / HDMI

Signal Processing Chain

Stage Algorithm What it does Latency
Gain sample × gain Amplify quiet I2S mic output (24-bit in 32-bit word) 0
HP filter 1-pole IIR, α=0.985 Remove DC offset + mains hum (< 115 Hz) 0 (IIR)
LP filter 1-pole IIR Optional: remove high-frequency noise (> 3 kHz) 0 (IIR)
8-band EQ Biquad peaking (Audio EQ Cookbook) Boost/cut frequency bands (60 Hz – 16 kHz) 0 (IIR)
Delay Circular buffer with feedback Echo effect (0–1000 ms, 0.2 feedback) = delay time
FFT FFTW 1024-point r2c Frequency spectrum for display N/A (display only)
GCC-PHAT Cross-power spectrum + IFFT Time difference of arrival between mics N/A (display only)

Voice FX — Why Each Approach Was Chosen

Effect Method Why this works Why others failed
Chipmunk History buffer (1s ring), read at 1.5x Reader is faster than writer → stays close, low latency Single-block resampling looped every 640 samples → distortion
Deep In-place per-block, read every 0.6th sample Zero latency, no drift History buffer at 0.7x → reader drifts behind, 300ms/s delay buildup, words cut off at safety reset
Robot Bitcrusher (decimate ×6) + 80 Hz ring mod No resampling needed, clean metallic sound Plain ring mod at 150 Hz was too buzzy

Latency Budget

Normal mode (-n 1024):
┌────────────────────┬──────────┬─────────────────────────────────────┐
│ Stage              │ Latency  │ Why                                 │
├────────────────────┼──────────┼─────────────────────────────────────┤
│ ALSA capture period│ 21.3 ms  │ Must collect 1024 samples           │
│ Ring buffer wait   │ 0-16 ms  │ Until next render frame (60fps)     │
│ DSP processing     │ < 1 ms   │ Filters + EQ + FFT                  │
│ Playback ring      │ 0-21 ms  │ Accumulation for period mismatch    │
│ ALSA playback buf  │ ~40 ms   │ 2-4 periods hardware buffer         │
├────────────────────┼──────────┼─────────────────────────────────────┤
│ TOTAL (visual)     │ ~25 ms   │ Capture → screen                    │
│ TOTAL (audio FX)   │ ~80 ms   │ Capture → speaker                   │
└────────────────────┴──────────┴─────────────────────────────────────┘

Low-latency mode (-l, -n 512):
  Capture: 10.7 ms → Total visual: ~15 ms, Total FX: ~50 ms

Lessons Learned — What Went Wrong and Why

Building this playback pipeline required several iterations. Each failure teaches a real-time systems concept:

Lesson 1: Render Loop ≠ Audio Rate

Problem: At 256-sample periods (187 blocks/s), the 60fps render loop only processed 1 block per frame — 127 blocks/s were dropped, starving the playback thread.

Fix: Drain ALL pending capture blocks in a for(;;) loop each frame. Every block goes through filters + EQ + playback. Only the last block feeds FFT/display.

Principle: In a multi-rate system, the consumer must keep up with the fastest producer. Never assume "one item per frame" when the producer runs at a different rate.

Lesson 2: ALSA Period Size Mismatch

Problem: Capture period = 256 samples, playback hardware period = 480. Writing 256 to a 480-period device caused constant underruns.

Fix: Accumulation buffer collects capture blocks until a full playback period is ready. Leftover samples carry over to the next write via memmove.

Principle: Hardware constraints (fixed period sizes) must be bridged in software. The accumulator is a classic rate-adaptation pattern.

Lesson 3: Don't Loop Within a Single Block for Pitch Shift

Problem: Resampling within a 1024-sample block (21ms) at ratio 1.6x wraps every 640 samples — replaying chunks of audio. Crossfading at the wrap helps with clicks but the fundamental distortion remains.

Fix: Circular history buffer (1 second) that accumulates audio continuously. The read pointer moves through the history at ratio speed, reading across block boundaries.

Principle: Pitch shifting needs continuous audio, not a looped snippet. The minimum "window" must be much larger than the pitch period of the voice (~5-10ms fundamental).

Lesson 4: Slower-Than-1x Reading Accumulates Delay

Problem: Deep voice at 0.7x ratio — reader consumes 717 samples per 1024-sample block. 307 samples accumulate per block → 300ms/s of growing delay. Safety reset snaps the reader forward → words cut off mid-sentence.

Fix: Process the current block in-place: read every 0.6th sample with linear interpolation. Zero added latency, no buffer management.

Principle: For ratio < 1.0, the reader is fundamentally slower than the writer. A ring buffer can't solve this — the data rate mismatch is structural. In-place processing avoids it entirely.

Lesson 5: Display Gain Goes Straight to the Speaker

Problem: Gain 32x amplifies the I2S mic signal for visualization. The same amplified signal went to playback. tanhf() soft clipping on every sample distorted even moderate levels (tanhf(0.5) = 0.46 = 8% THD).

Fix: Keep the gain (I2S mics genuinely need it), use hard clip (only fires on actual overs). The gain is part of the signal path, not just display scaling.

Principle: Understand what each gain stage does. I2S microphones output 24-bit data in a 32-bit word — the signal IS that quiet. The gain compensates for hardware, not display preference.

Lesson 6: Spectral Subtraction Is Hard to Get Right

Problem: FFT-based noise removal with overlap-add caused "pipe" sound — the window math was replacing samples instead of blending, and phase artifacts from the FFT round-trip degraded quality.

Fix: Simple adaptive expander gate (no FFT). Tracks noise floor as minimum RMS, attenuates signal near the floor. Transparent to voice — no coloring.

Principle: Simpler is often better. Spectral subtraction requires careful overlap-add normalization, phase continuity, and proper spectral floor calculation. For a demo, the complexity isn't worth the marginal improvement over a well-tuned gate.

Connection to Real-Time Systems

These exact patterns appear in every real-time embedded system:

Audio pipeline concept Motor control equivalent
ALSA period = processing deadline Control loop period
Underrun = missed deadline Actuator jitter / stall
Ring buffer = rate decoupling Sensor FIFO
SCHED_FIFO for audio threads RT priority for control task
Period size tradeoff Sample rate tradeoff
Accumulation buffer Rate converter between sensor and actuator

See Audio Pipeline Latency for hands-on measurement exercises.


What Just Happened?

Compare this app with the other SDL2 applications you have built:

Level Display Audio Visualizer
Input BMI160 IMU (SPI/I2C) I2S microphone (ALSA)
Capture read() from IIO sysfs snd_pcm_readi() from ALSA
Threading Sensor thread + render thread Audio + render + playback threads
Data flow Sensor → atomic float → renderer Mic → ring buffer → DSP → renderer + playback
DSP Complementary filter (roll/pitch) HP/LP/EQ filters, FFT, GCC-PHAT, voice FX
Display Artificial horizon (geometry) Waveform, spectrum, spectrogram, TDOA, EQ overlay
Output Display only Display + audio playback with effects

The pattern is the same: a capture thread feeds data to a render thread through a shared buffer. The full version adds a playback thread creating a three-thread pipeline — a common architecture in media applications.


Course Overview | Back: Level Display SDL2