I2S Microphone Audio Visualizer
Time estimate: ~90 minutes Prerequisites: SDL2 Level Display, Display Setup
Learning Objectives
By the end of this tutorial you will be able to:
- Configure an I2S MEMS microphone on Raspberry Pi using Device Tree overlays
- Capture audio through ALSA in a dedicated thread
- Apply real-time DSP: high-pass filtering, windowing, FFT
- Render waveform, spectrum, and scrolling spectrogram with SDL2
- (Optional) Use GCC-PHAT cross-correlation on two synchronized channels to estimate sound direction
Why I2S?
The Raspberry Pi has no analogue audio input. I2S (Inter-IC Sound) is a digital serial bus designed for audio — it carries a bit clock (BCLK), word select (LRCLK), and data (DOUT) directly from the microphone's ADC. No external codec needed.
Common I2S MEMS microphones like the INMP441 or SPH0645 connect directly to the Pi's PCM/I2S pins and appear as an ALSA capture device once the overlay is loaded. This tutorial builds a real-time audio visualization app on top of that capture stream.
Architecture Overview
┌───────────────┐ ┌──────────────┐ ┌───────────────┐
│ I2S MEMS Mic │───▶│ ALSA Capture │───▶│ Ring Buffer │
│ (INMP441) │ │ (thread) │ │ (lock-free) │
└───────────────┘ └──────────────┘ └───────┬───────┘
│
▼
┌────────────────┐
│ DSP Pipeline │
│ HP filter │
│ Hann window │
│ FFT (FFTW3) │
│ GCC-PHAT (2ch)│
└────────┬───────┘
│
▼
┌────────────────┐
│ SDL2 Renderer │
│ Waveform │
│ Spectrum bars │
│ Spectrogram │
│ Direction (2ch)│
└────────────────┘
The design separates audio capture from rendering using a ring buffer and two threads — the same pattern used in the Level Display tutorial for sensor data. The audio thread pushes blocks of samples; the render thread processes and draws the latest block each frame.
1. Hardware Setup
Wiring
Connect an INMP441 (or SPH0645) I2S microphone to the Pi's GPIO header:
| Mic Pin | Pi GPIO | Function |
|---|---|---|
| VDD | 3.3V | Power |
| GND | GND | Ground |
| SCK | GPIO 18 | Bit clock (BCLK) |
| WS | GPIO 19 | Word select / LRCLK |
| SD | GPIO 20 | Data out (DOUT) |
| L/R | GND | Channel select: GND = left, VDD = right |
Warning
The INMP441 is a 3.3V device. Do not connect VDD to 5V.
For two-channel capture (stereo), connect a second microphone to the same BCLK/WS/SD lines but tie its L/R pin to VDD (right channel). Both mics share the same data line — one transmits on the left slot, the other on the right.
Enable the I2S Overlay
Add the microphone overlay to /boot/firmware/config.txt:
# For a simple I2S mic (mono or stereo pair on one data line)
echo "dtoverlay=googlevoicehat-soundcard" | sudo tee -a /boot/firmware/config.txt
Tip
The googlevoicehat-soundcard overlay is the simplest way to enable I2S capture on the Pi — it configures the PCM/I2S interface for digital microphones without an external codec. Alternatively, you can use dtoverlay=i2s-mmap or a custom overlay. Check ls /boot/firmware/overlays/ for available options.
Reboot:
After reboot, verify the sound card appeared:
Expected output:
Checkpoint — I2S Mic Detected
arecord -l shows the I2S sound card. If not, check wiring and the overlay.
2. Test Capture with ALSA
Before writing code, verify audio capture works:
# Record 5 seconds of audio
arecord -D hw:1,0 -f S32_LE -r 48000 -c 2 -d 5 test.wav
# Play back (connect headphones or use another device)
aplay test.wav
Note
I2S mics typically output 24-bit audio in a 32-bit container (S32_LE). The app converts to float internally. If hw:1,0 does not work, check the card number from arecord -l and adjust.
Check the signal:
You should see the VU meter respond to sound. Clap your hands — the meter should spike.
3. Install Dependencies
These provide:
| Library | Purpose |
|---|---|
| SDL2 | Window, renderer, event loop |
ALSA (libasound2) |
Audio capture API |
FFTW3 (libfftw3f) |
Single-precision FFT |
4. Build the Visualizer
The source is in src/embedded-linux/apps/i2s-audio-viz/audio_viz.c:
Or build manually:
gcc -Wall -O2 $(sdl2-config --cflags) -o audio_viz audio_viz.c \
$(sdl2-config --libs) -lasound -lfftw3f -lm -lpthread
5. Run
# Mono (1 mic, default device hw:0)
./audio_viz -d hw:1,0 -c 1
# Stereo with direction detection (2 mics on same I2S line)
./audio_viz -d hw:1,0 -c 2 -m 0.06
# Boost quiet mic signal
./audio_viz -d hw:1,0 -c 1 -g 8.0
# Longer waveform view (200 ms)
./audio_viz -d hw:1,0 -c 1 -w 200
# Fine frequency resolution (4096-point FFT)
./audio_viz -d hw:1,0 -c 1 -n 4096
# Custom sample rate
./audio_viz -d hw:1,0 -r 16000 -n 512
Command-line options:
| Flag | Default | Description |
|---|---|---|
-d |
hw:0 |
ALSA device |
-r |
48000 | Sample rate (Hz) |
-c |
1 | Channels (1 = mono, 2 = stereo) |
-n |
1024 | Period size / FFT window |
-g |
4.0 | Software gain (I2S mics are often quiet) |
-w |
50 | Waveform display length in ms |
-m |
0.06 | Mic spacing in metres (for direction) |
Press Q or close the window to exit.
Checkpoint — Visualizer Running
You should see: waveform at top, RMS/peak meter, FFT bar spectrum, and scrolling spectrogram. Speak or clap — the display should react in real time.
6. How It Works
Audio Thread and Ring Buffer
The audio thread runs snd_pcm_readi() in a tight loop, capturing one period (1024 frames) at a time. Each block is written to a ring buffer protected by a mutex:
snd_pcm_sframes_t n = snd_pcm_readi(pcm, tmp, period_frames);
pthread_mutex_lock(&ring_mtx);
memcpy(ring_buf + ring_write * slot_size, tmp, n * channels * sizeof(float));
ring_write = (ring_write + 1) % RING_SLOTS;
pthread_mutex_unlock(&ring_mtx);
If the ring fills up (render thread is too slow), the oldest block is dropped. This prevents the audio thread from blocking — it always keeps capturing.
Why Not SDL_AudioSpec?
SDL2 has its own audio API (SDL_OpenAudioDevice), but for I2S microphones on Linux, using ALSA directly gives you more control: you can specify the exact hardware device, format negotiation, and period size. SDL2's audio backend on Linux wraps ALSA anyway.
DSP Pipeline
Each block goes through these steps:
1. High-pass filter — removes DC offset and low-frequency rumble. I2S mics often have a significant DC component:
/* 1st-order IIR high-pass: y[n] = α(y[n-1] + x[n] - x[n-1]) */
float y = alpha * (*y_prev + x - *x_prev);
With α = 0.995 at 48 kHz, the cutoff is approximately 80 Hz.
2. Hann window — multiplies the block by a raised-cosine window to reduce spectral leakage before FFT:
Without windowing, the FFT of a rectangular block produces wide spectral lobes that smear energy across bins.
3. FFT — FFTW3 computes the real-to-complex DFT. The output has N/2 + 1 complex bins:
4. Magnitude in dB — each bin's magnitude is converted to decibels for display:
The 1e-10 prevents log10(0).
Visualization
The render loop draws four panels:
| Panel | What it shows |
|---|---|
| Waveform | Raw time-domain signal (amplitude vs. sample index) |
| Level meter | RMS (green bar) and peak (red marker) in dB |
| Spectrum | FFT magnitude bars, colour-coded by level |
| Spectrogram | Scrolling time × frequency heatmap — each column is one FFT frame |
The spectrogram uses an SDL2 texture (SDL_TEXTUREACCESS_STREAMING) that gets one column updated per audio block. The texture is rendered with a scroll offset so the newest data appears on the right edge.
7. Signal Processing Fundamentals
This section covers the core signal processing concepts behind the visualizer. You do not need this to run the app, but it explains why each DSP step exists and what the display is actually showing you.
Tip
See it interactively: Run these demos on your host PC alongside reading this section:
Each demo has sliders to adjust parameters in real time — much easier to build intuition than reading equations alone.What Is a Signal?
A signal is a quantity that varies over time. For audio, it is air pressure fluctuations captured by the microphone. The MEMS mic converts these tiny pressure changes into a voltage, and the I2S ADC converts that voltage into a stream of digital numbers — one number per sample.
At 48 kHz sample rate, you get 48,000 numbers per second. Each number represents the air pressure at that instant. This stream of numbers is the time-domain representation of the signal.
Sampling and the Nyquist Limit
The sampling theorem says: to faithfully capture a signal, you must sample at least twice the highest frequency present.
At 48 kHz, you can capture frequencies up to 24 kHz — well above human hearing (~20 kHz). At 16 kHz, the limit is 8 kHz — enough for speech but not music.
If a frequency above Nyquist is present in the signal, it appears as a lower frequency in the captured data — this artifact is called aliasing. The MEMS mic's internal anti-aliasing filter prevents this.

A 440 Hz sine wave sampled at 4 rates. At 2000 Hz and 1000 Hz (above Nyquist), the red reconstruction perfectly matches the blue original. At 700 Hz (below Nyquist), the reconstructed signal is a wrong frequency — aliasing. At 440 Hz (exactly 1×), only DC is captured.
Run locally: python scripts/signal-processing-demo/sampling_aliasing.py or -i for interactive slider
Time Domain vs. Frequency Domain
The waveform shows the signal in the time domain: amplitude on the y-axis, time on the x-axis. You can see:
- Loud vs. quiet (amplitude)
- Fast vs. slow oscillations (pitch)
- Transients (clicks, claps)
But you cannot easily see which frequencies are present — a complex sound (speech, music) looks like a messy wave. This is where the frequency domain helps.
The FFT (Fast Fourier Transform) converts a block of time-domain samples into a set of frequency bins, each telling you how much energy is at that frequency. The result is the spectrum — the bar chart in the app.
How the FFT Works (Intuitively)
The FFT answers: "how much of each sine wave frequency is present in this block?"
Given N samples, the FFT produces N/2 + 1 frequency bins:
| Bin | Frequency | What it represents |
|---|---|---|
| 0 | 0 Hz (DC) | Average value of the block |
| 1 | fs/N Hz | Lowest frequency detectable |
| 2 | 2·fs/N Hz | Second harmonic |
| ... | ... | ... |
| N/2 | fs/2 Hz (Nyquist) | Highest frequency detectable |
Each bin is a complex number (real + imaginary part). The magnitude √(re² + im²) tells you how loud that frequency is. The phase tells you the timing — GCC-PHAT uses this for direction detection.
Why We Use a Window Function
The FFT assumes the input block repeats forever. If the block doesn't start and end at zero, the FFT sees a discontinuity at the block boundary, which spreads energy across all bins — this is spectral leakage.
The Hann window tapers the signal smoothly to zero at both ends:
This eliminates the discontinuity at the cost of slightly reduced frequency resolution. Other windows (Hamming, Blackman, Kaiser) offer different trade-offs between resolution and leakage.

Top: window shapes. Middle: spectrum of 440 Hz + 466 Hz (-20 dB). The rectangular window's leakage completely hides the weak 466 Hz tone. Hann and Blackman reveal it clearly. Bottom: zoomed view showing each window's main lobe width vs side lobe level.
Run locally: python scripts/signal-processing-demo/fft_windowing.py or -i for interactive window switching
Decibels (dB)
The spectrum is displayed in decibels because human hearing is logarithmic — we perceive loudness on a ratio scale, not a linear scale.
| dB value | Meaning |
|---|---|
| 0 dB | Full-scale signal (clipping) |
| -20 dB | 10× quieter than full scale |
| -40 dB | 100× quieter |
| -60 dB | 1000× quieter (quiet room) |
| -80 dB | Noise floor |
A change of 6 dB roughly doubles the perceived loudness.
Filtering
A filter selectively passes or removes certain frequencies from a signal.
The three basic filter types:
| Type | Passes | Blocks | Use in this app |
|---|---|---|---|
| High-pass | Above cutoff | Below cutoff | Remove DC offset and rumble (80 Hz) |
| Low-pass | Below cutoff | Above cutoff | Isolate speech band |
| Band-pass | Between two cutoffs | Outside the band | Focus on a target frequency range |
The app uses a 1st-order IIR (Infinite Impulse Response) high-pass filter — a simple recursive formula that only needs the previous input and output sample:
Higher α → lower cutoff frequency. This is the simplest filter that works in real time with zero latency.
From Single Signal to Two-Microphone Array
With one microphone, you know what sound is present (spectrum) but not where it comes from. Adding a second microphone enables spatial information:
If sound arrives from the left, it reaches the left mic first. The time difference of arrival (TDOA) between the two channels encodes the direction. Section 8 explains how to extract this.
8. Understanding the Display
Reading the Spectrum
The x-axis of the spectrum represents frequency. For a 48 kHz sample rate and 1024-point FFT:
- Each bin spans
48000 / 1024 ≈ 46.9 Hz - Bin 0 = DC (0 Hz)
- Bin 512 = Nyquist (24 kHz)
- Human speech fundamental: bins ~2–8 (100–400 Hz)
- Clap/snap transient: broad peak across many bins
Reading the Spectrogram
- x-axis: time (scrolling left)
- y-axis: frequency (0 Hz at bottom, Nyquist at top)
- colour: magnitude (black = silence, blue = quiet, yellow = loud, white = very loud)
A sustained tone appears as a horizontal line. A clap appears as a vertical bright column (energy at all frequencies simultaneously). Speech shows a characteristic harmonic ladder pattern.
Frequency Resolution vs. Time Resolution
There is a fundamental trade-off:
| FFT size | Freq. resolution | Time resolution | Use case |
|---|---|---|---|
| 256 | ~188 Hz | ~5.3 ms | Fast transients |
| 512 | ~94 Hz | ~10.7 ms | General purpose |
| 1024 | ~47 Hz | ~21.3 ms | Default — good balance |
| 2048 | ~23 Hz | ~42.7 ms | Fine frequency detail |
| 4096 | ~12 Hz | ~85.3 ms | Music analysis |
Use the -n flag to experiment: -n 256 for snappy response, -n 4096 for detailed frequency analysis.
9. Direction Detection (Stereo Mode)
Warning
Direction detection requires two synchronized microphones on the same I2S data line — one set to L channel (L/R → GND), one to R channel (L/R → VDD). Two separate I2S interfaces will not work reliably because their clocks drift.
Stereo Hardware Setup
Wire both INMP441 mics to the same BCLK/WS/SD lines. The only difference is the L/R pin:
| Mic 1 (Left) | Mic 2 (Right) | |
|---|---|---|
| SCK | GPIO 18 | GPIO 18 |
| WS | GPIO 19 | GPIO 19 |
| SD | GPIO 20 | GPIO 20 |
| L/R | GND | VDD (3.3V) |
Both mics share the data line but transmit in different time slots (left slot vs. right slot). ALSA delivers them as interleaved stereo frames.
Mount the mics with a known spacing — measure the distance between the two mic holes. The default is 6 cm (-m 0.06).
The Physics: How Direction Creates a Time Delay
When a sound source is directly in front of both mics (broadside), the sound wave arrives at both mics simultaneously — zero delay.
When the source is to the left, sound reaches the left mic first. The extra distance the wave must travel to reach the right mic creates a time difference of arrival (TDOA):
Sound source (to the left)
↓
/ \
/ \
/ \
Mic L ---- Mic R
←── d ──→
Extra path length = d · sin(θ)
Time delay τ = d · sin(θ) / c (c = 343 m/s speed of sound)
For 6 cm spacing: - Source at 90° (hard left): τ = 0.06 / 343 = 175 μs = 8.4 samples at 48 kHz - Source at 45°: τ = 0.06 · sin(45°) / 343 = 124 μs = 5.9 samples - Source at 0° (broadside/center): τ = 0 μs
The maximum delay is only ~8 samples — this is why synchronized clocks matter. Even 1 sample of drift would cause a 12° error.
Cross-Correlation: Finding the Delay
The simplest way to find the TDOA is cross-correlation — slide one signal past the other and measure how well they match at each offset:
The lag k that produces the highest R₁₂[k] is the estimated delay. This works but produces broad peaks that are hard to pinpoint, especially in reverberant rooms.
GCC-PHAT: A Better Cross-Correlation
Generalized Cross-Correlation with Phase Transform (GCC-PHAT) sharpens the peak dramatically:
1. X₁(f) = FFT(x₁) — frequency domain of mic 1
2. X₂(f) = FFT(x₂) — frequency domain of mic 2
3. G₁₂(f) = X₁(f) · X₂*(f) — cross-power spectrum
4. Ĝ₁₂(f) = G₁₂(f) / |G₁₂(f)| — PHAT: normalize magnitude to 1
5. R₁₂[k] = IFFT(Ĝ₁₂) — inverse FFT → correlation
6. τ = argmax(R₁₂) — peak = delay in samples
Step 4 is the key: by dividing out the magnitude, only the phase difference between channels remains. Phase encodes timing; magnitude encodes loudness. By keeping only phase, the result has a sharp spike at the true delay regardless of the signal's spectral shape.
In C, this is:
/* Cross-power with PHAT weighting */
float cre = re1 * re2 + im1 * im2; /* real part of X1 · conj(X2) */
float cim = im1 * re2 - re1 * im2; /* imaginary part */
float mag = sqrtf(cre*cre + cim*cim) + 1e-10f;
cross[i][0] = cre / mag; /* normalize to unit magnitude */
cross[i][1] = cim / mag;
From Delay to Angle
Once you have the delay in samples:
τ = delay_samples / sample_rate (convert to seconds)
sin(θ) = c · τ / d (c = 343 m/s, d = mic spacing)
θ = arcsin(c · τ / d) (angle from broadside)
The app smooths the delay estimate with an exponential moving average to avoid jitter:
Resolution Limits
| Parameter | Value | Effect |
|---|---|---|
| Mic spacing | 6 cm | Max delay = 8.4 samples |
| Sample rate | 48 kHz | 1 sample = 20.8 μs = 7.1 mm |
| Discrete angles | ~17 | Without interpolation |
| Angular resolution | ~6° | At broadside (best case) |
| Ambiguity | Front/back | Cannot distinguish with 2 mics |
Sub-sample interpolation (parabolic fit around the peak) can improve resolution to ~1°. See the Challenges section.
The Direction Indicator
In stereo mode, the app shows a circular direction indicator in the top-right corner. The yellow line points toward the estimated sound direction; its length represents the confidence (correlation peak height — higher peak = more certain).
Try these experiments: - Clap from different sides — the indicator should swing left/right - Speak steadily — the indicator should track your position - Two people speaking — the indicator jumps between them (it tracks the loudest source) - Snap fingers close to one mic — the indicator should point hard left or right
Tip
If the direction seems inverted (left/right swapped), your mic L/R pin assignment is reversed. Either swap the L/R wires or pass a negative mic distance: -m -0.06.
10. Practical Limits of Multi-Mic Direction Finding
Why Not 5 Separate I2S Mics?
The Raspberry Pi's I2S interface supports 2 channels per data line (left/right slot selection). Common I2S MEMS mics like the INMP441 only transmit on one slot.
To use more than 2 microphones, you would need:
- TDM (Time Division Multiplexing) — some codecs support 4–8 channels on one data line, but the Pi's PCM hardware has limited TDM support
- Multiple I2S data lines — Pi 5 can stripe stereo pairs across data lines, but Pi 4 cannot
- USB multichannel audio — a USB mic array (e.g., ReSpeaker) provides 4–8 synchronized channels as a single ALSA device
Connecting 5 separate unsynchronized I2S mics will not give reliable direction detection — even tiny clock drift between interfaces causes the delay estimate to wander over time.
| Setup | Channels | Direction | Notes |
|---|---|---|---|
| 1 mic | Mono | No | Waveform + spectrum only |
| 2 mics, same I2S line | Stereo | Left/right axis | Good for basic DOA |
| USB mic array (4–8 ch) | Multi | 2D angle | Best for real direction finding |
| 5 separate I2S mics | 5 × mono | Unreliable | Clocks drift, TDOA breaks down |
What Works for 2 Mics
With 2 synchronized mics you can:
- Detect left vs. centre vs. right (1D angle on the mic axis)
- Estimate direction within approximately ±90° from broadside
- Track a moving source in real time (clap, speech, footsteps)
You cannot determine elevation or distinguish front from back — that requires 3+ mics in a non-collinear arrangement.
11. Filtering Reference
The app applies a simple high-pass filter. Here are other useful filters you can add:
| Filter | Purpose | Implementation |
|---|---|---|
| High-pass (80 Hz) | Remove DC + rumble | 1-pole IIR (included) |
| Low-pass (8 kHz) | Speech band only | 1-pole IIR: y = α·x + (1-α)·y_prev |
| Band-pass (300–3400 Hz) | Telephone speech band | Chain HP + LP |
| Moving average | Smooth RMS/peak meters | Circular buffer average |
| Exponential smoothing | Smooth spectrum display | y = α·x + (1-α)·y_prev per bin |
| Notch (50/60 Hz) | Remove mains hum | 2nd-order IIR |
| AGC | Auto-gain control | Track RMS, scale to target |
For the 1st-order IIR high-pass used in this app:
/* Cutoff frequency fc, sample rate fs:
* α ≈ 1 / (1 + 2π·fc/fs)
*
* fc = 80 Hz, fs = 48000 → α ≈ 0.9896
* We use 0.995 for a slightly higher cutoff.
*/
y[n] = α · (y[n-1] + x[n] - x[n-1])
Challenges
Tip
Try extending the visualizer. Each challenge below has a guided solution with theory explanations — try solving it first, then check the hints if stuck.
- Noise gate — only update the direction estimate when the RMS level exceeds a threshold (ignore silence)
- Peak hold — draw a slowly-decaying peak line above the spectrum bars (classic audio meter style)
- Musical note tuner — detect the dominant frequency and display the nearest musical note (A4 = 440 Hz)
- WAV recording — add a key (R) to start/stop recording the raw audio to a WAV file
- Band-pass filter — add a configurable band-pass filter and show the filtered signal alongside the raw signal
- Sub-sample TDOA — implement parabolic interpolation around the GCC-PHAT peak for better angular resolution
- CPU budget analysis — measure how much time each processing stage takes, find the FFT size limit
Guided solutions with theory: Audio Visualizer Challenges
Background reading: Signal Processing Reference — sampling, FFT, filtering, DSP architectures
Pipeline deep-dive: Audio Pipeline Latency — measure and optimize end-to-end latency, understand the latency–reliability tradeoff, connect to real-time systems concepts
Full Pipeline Architecture (audio_viz_full)
The full-featured version (audio_viz_full.c) adds real-time playback with voice effects, an 8-band EQ, and TDOA visualization. Understanding its architecture is a real-time systems case study.
Data Flow
┌────────────────────────────────────────────────────┐
│ CAPTURE THREAD (SCHED_FIFO 60) │
│ │
I2S Mic ─────────▶│ ALSA snd_pcm_readi() ──▶ S32→float ──▶ ×gain │
(INMP441) │ │ │
└──────────────────────────────────────────┼────────┘
│
┌──────────▼──────────┐
│ Capture Ring Buf │
│ (8 slots, mutex) │
└──────────┬──────────┘
│
┌──────────────────────────────────────────────────┼───────────┐
│ RENDER THREAD (main, 60fps vsync) │ │
│ ▼ │
│ ┌─── drain ALL pending blocks (for loop) ───────────────┐ │
│ │ │ │
│ │ ┌─────────────┐ ┌────────────┐ ┌───────────────┐ │ │
│ │ │ HP Filter │─▶│ LP Filter │─▶│ 8-band EQ │ │ │
│ │ │ (115 Hz) │ │ (3 kHz opt)│ │ (biquad peak) │ │ │
│ │ └─────────────┘ └────────────┘ └───────┬───────┘ │ │
│ │ │ │ │
│ │ ┌────────────┐ ┌─────────────────────┤ │ │
│ │ │ Delay/Echo │◀────┘ │ │ │
│ │ │ (optional) │ │ │ │
│ │ └─────┬──────┘ │ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌──────────────────────────┐ ┌────────────────┐ │ │
│ │ │ PLAYBACK COPY │ │ WAV Recording │ │ │
│ │ │ ┌──────────────────────┐ │ │ (if active) │ │ │
│ │ │ │ Noise Gate (opt) │ │ └────────────────┘ │ │
│ │ │ ├──────────────────────┤ │ │ │
│ │ │ │ Voice FX │ │ │ │
│ │ │ │ ├ Chipmunk (1.5x) │ │ │ │
│ │ │ │ ├ Deep (0.6x stretch)│ │ │ │
│ │ │ │ └ Robot (ring mod) │ │ │ │
│ │ │ ├──────────────────────┤ │ │ │
│ │ │ │ Hard clip ±1.0 │ │ │ │
│ │ │ └──────────────────────┘ │ │ │
│ │ └───────────┬──────────────┘ │ │
│ │ ▼ │ │
│ │ ┌───────────────────┐ │ │
│ │ │ Playback Ring Buf │──── condvar signal ──────┐ │ │
│ │ │ (8 slots) │ │ │ │
│ │ └───────────────────┘ │ │ │
│ └─── end drain loop (repeat for each block) ──────┼──┘ │
│ │ │
│ ┌─── LAST block only ─────────────────────────┐ │ │
│ │ FFT → magnitude → spectrum + spectrogram │ │ │
│ │ GCC-PHAT → TDOA → direction indicator │ │ │
│ │ RMS, peak, frequency, note detection │ │ │
│ │ Waveform history ring buffer │ │ │
│ └──────────────────────────┬──────────────────┘ │ │
│ ▼ │ │
│ SDL2 Render │ │
│ + Overlays (EQ, TDOA) │ │
│ + Button bar │ │
└────────────────────────────────────────────────────┼─────┘
│
┌────────────────────────────────────────────────────┼─────┐
│ PLAYBACK THREAD (SCHED_FIFO 55) │ │
│ ▼ │
│ condvar wait ──▶ accumulate ──▶ float→S16 ──▶ ALSA │
│ (cap 256 → writei │
│ play 480) │
└─────────────────────────────────────────────────────────┘
│
▼
Headphone Jack / HDMI
Signal Processing Chain
| Stage | Algorithm | What it does | Latency |
|---|---|---|---|
| Gain | sample × gain |
Amplify quiet I2S mic output (24-bit in 32-bit word) | 0 |
| HP filter | 1-pole IIR, α=0.985 | Remove DC offset + mains hum (< 115 Hz) | 0 (IIR) |
| LP filter | 1-pole IIR | Optional: remove high-frequency noise (> 3 kHz) | 0 (IIR) |
| 8-band EQ | Biquad peaking (Audio EQ Cookbook) | Boost/cut frequency bands (60 Hz – 16 kHz) | 0 (IIR) |
| Delay | Circular buffer with feedback | Echo effect (0–1000 ms, 0.2 feedback) | = delay time |
| FFT | FFTW 1024-point r2c | Frequency spectrum for display | N/A (display only) |
| GCC-PHAT | Cross-power spectrum + IFFT | Time difference of arrival between mics | N/A (display only) |
Voice FX — Why Each Approach Was Chosen
| Effect | Method | Why this works | Why others failed |
|---|---|---|---|
| Chipmunk | History buffer (1s ring), read at 1.5x | Reader is faster than writer → stays close, low latency | Single-block resampling looped every 640 samples → distortion |
| Deep | In-place per-block, read every 0.6th sample | Zero latency, no drift | History buffer at 0.7x → reader drifts behind, 300ms/s delay buildup, words cut off at safety reset |
| Robot | Bitcrusher (decimate ×6) + 80 Hz ring mod | No resampling needed, clean metallic sound | Plain ring mod at 150 Hz was too buzzy |
Latency Budget
Normal mode (-n 1024):
┌────────────────────┬──────────┬─────────────────────────────────────┐
│ Stage │ Latency │ Why │
├────────────────────┼──────────┼─────────────────────────────────────┤
│ ALSA capture period│ 21.3 ms │ Must collect 1024 samples │
│ Ring buffer wait │ 0-16 ms │ Until next render frame (60fps) │
│ DSP processing │ < 1 ms │ Filters + EQ + FFT │
│ Playback ring │ 0-21 ms │ Accumulation for period mismatch │
│ ALSA playback buf │ ~40 ms │ 2-4 periods hardware buffer │
├────────────────────┼──────────┼─────────────────────────────────────┤
│ TOTAL (visual) │ ~25 ms │ Capture → screen │
│ TOTAL (audio FX) │ ~80 ms │ Capture → speaker │
└────────────────────┴──────────┴─────────────────────────────────────┘
Low-latency mode (-l, -n 512):
Capture: 10.7 ms → Total visual: ~15 ms, Total FX: ~50 ms
Lessons Learned — What Went Wrong and Why
Building this playback pipeline required several iterations. Each failure teaches a real-time systems concept:
Lesson 1: Render Loop ≠ Audio Rate
Problem: At 256-sample periods (187 blocks/s), the 60fps render loop only processed 1 block per frame — 127 blocks/s were dropped, starving the playback thread.
Fix: Drain ALL pending capture blocks in a for(;;) loop each frame. Every block goes through filters + EQ + playback. Only the last block feeds FFT/display.
Principle: In a multi-rate system, the consumer must keep up with the fastest producer. Never assume "one item per frame" when the producer runs at a different rate.
Lesson 2: ALSA Period Size Mismatch
Problem: Capture period = 256 samples, playback hardware period = 480. Writing 256 to a 480-period device caused constant underruns.
Fix: Accumulation buffer collects capture blocks until a full playback period is ready. Leftover samples carry over to the next write via memmove.
Principle: Hardware constraints (fixed period sizes) must be bridged in software. The accumulator is a classic rate-adaptation pattern.
Lesson 3: Don't Loop Within a Single Block for Pitch Shift
Problem: Resampling within a 1024-sample block (21ms) at ratio 1.6x wraps every 640 samples — replaying chunks of audio. Crossfading at the wrap helps with clicks but the fundamental distortion remains.
Fix: Circular history buffer (1 second) that accumulates audio continuously. The read pointer moves through the history at ratio speed, reading across block boundaries.
Principle: Pitch shifting needs continuous audio, not a looped snippet. The minimum "window" must be much larger than the pitch period of the voice (~5-10ms fundamental).
Lesson 4: Slower-Than-1x Reading Accumulates Delay
Problem: Deep voice at 0.7x ratio — reader consumes 717 samples per 1024-sample block. 307 samples accumulate per block → 300ms/s of growing delay. Safety reset snaps the reader forward → words cut off mid-sentence.
Fix: Process the current block in-place: read every 0.6th sample with linear interpolation. Zero added latency, no buffer management.
Principle: For ratio < 1.0, the reader is fundamentally slower than the writer. A ring buffer can't solve this — the data rate mismatch is structural. In-place processing avoids it entirely.
Lesson 5: Display Gain Goes Straight to the Speaker
Problem: Gain 32x amplifies the I2S mic signal for visualization. The same amplified signal went to playback. tanhf() soft clipping on every sample distorted even moderate levels (tanhf(0.5) = 0.46 = 8% THD).
Fix: Keep the gain (I2S mics genuinely need it), use hard clip (only fires on actual overs). The gain is part of the signal path, not just display scaling.
Principle: Understand what each gain stage does. I2S microphones output 24-bit data in a 32-bit word — the signal IS that quiet. The gain compensates for hardware, not display preference.
Lesson 6: Spectral Subtraction Is Hard to Get Right
Problem: FFT-based noise removal with overlap-add caused "pipe" sound — the window math was replacing samples instead of blending, and phase artifacts from the FFT round-trip degraded quality.
Fix: Simple adaptive expander gate (no FFT). Tracks noise floor as minimum RMS, attenuates signal near the floor. Transparent to voice — no coloring.
Principle: Simpler is often better. Spectral subtraction requires careful overlap-add normalization, phase continuity, and proper spectral floor calculation. For a demo, the complexity isn't worth the marginal improvement over a well-tuned gate.
Connection to Real-Time Systems
These exact patterns appear in every real-time embedded system:
| Audio pipeline concept | Motor control equivalent |
|---|---|
| ALSA period = processing deadline | Control loop period |
| Underrun = missed deadline | Actuator jitter / stall |
| Ring buffer = rate decoupling | Sensor FIFO |
SCHED_FIFO for audio threads |
RT priority for control task |
| Period size tradeoff | Sample rate tradeoff |
| Accumulation buffer | Rate converter between sensor and actuator |
See Audio Pipeline Latency for hands-on measurement exercises.
What Just Happened?
Compare this app with the other SDL2 applications you have built:
| Level Display | Audio Visualizer | |
|---|---|---|
| Input | BMI160 IMU (SPI/I2C) | I2S microphone (ALSA) |
| Capture | read() from IIO sysfs |
snd_pcm_readi() from ALSA |
| Threading | Sensor thread + render thread | Audio + render + playback threads |
| Data flow | Sensor → atomic float → renderer | Mic → ring buffer → DSP → renderer + playback |
| DSP | Complementary filter (roll/pitch) | HP/LP/EQ filters, FFT, GCC-PHAT, voice FX |
| Display | Artificial horizon (geometry) | Waveform, spectrum, spectrogram, TDOA, EQ overlay |
| Output | Display only | Display + audio playback with effects |
The pattern is the same: a capture thread feeds data to a render thread through a shared buffer. The full version adds a playback thread creating a three-thread pipeline — a common architecture in media applications.