Audio Visualizer — Guided Challenges

Prerequisites: I2S Audio Visualizer (completed and running)

About These Challenges

Each challenge starts with a brief task description, followed by expandable sections: Theory explains the concept, Approach outlines the algorithm, and Solution provides working code. Try solving the challenge yourself first — expand the hints only when stuck.

See also: Signal Processing Reference for the underlying math.

Challenge 1: Noise Gate for Direction Estimation

Only update the direction indicator when someone is actually making sound — ignore silence.

The direction estimate drifts randomly during silence because GCC-PHAT finds correlations in noise. A noise gate checks the signal level and freezes the direction when it's too quiet.

Theory: Noise Gates

A noise gate is the simplest dynamics processor: if the signal level is below a threshold, the output is suppressed. In audio engineering, this removes background noise between phrases. For direction estimation, we use it to freeze the angle display when there's no meaningful signal.

The key decision is where to set the threshold. Too low and noise still triggers updates. Too high and quiet sounds are ignored. A good starting point is -40 dB RMS — this is well above typical MEMS microphone self-noise (-60 to -50 dB) but catches normal speech (-30 to -10 dB).

Signal flow:
  Audio → RMS calculation → Compare to threshold → Gate open/closed
                                                       ↓
                                         GCC-PHAT runs only when open

Tip

Approach: Calculate RMS of the current audio block. If RMS is above the threshold (e.g., -40 dBFS), run GCC-PHAT and update the angle. If below, skip the GCC-PHAT calculation entirely and keep the last known angle. Add a -T command-line flag for the threshold in dB.

Tip

Solution:

In the have_data block, wrap the GCC-PHAT section:

/* Noise gate for direction estimation */
float rms_level = compute_rms(ch1_buf, period_frames);
float rms_db = 20.0f * log10f(rms_level + 1e-10f);
int gate_open = (rms_db > gate_threshold_db);

if (channels == 2 && gate_open) {
    gcc_phat(ch1_buf, ch2_buf, corr, fft_size, ...);
    /* ... angle calculation ... */
}
/* When gate is closed, angle_deg and confidence retain their last values */

Add the CLI flag:

static float gate_threshold_db = -40.0f;

/* In getopt: */
case 'T': gate_threshold_db = atof(optarg); break;

Optional enhancement: add a visual indicator — dim the direction circle when the gate is closed:

/* Before drawing the direction indicator */
if (!gate_open)
    SDL_SetRenderDrawColor(ren, 30, 30, 30, 255); /* dim */

Challenge 2: Peak Hold Spectrum

Draw a slowly-decaying peak line above the spectrum bars — the classic audio meter look.

Theory: Peak Hold and Ballistics

Professional audio meters show two things: the current level (fast-moving bar) and the peak level (a thin line that rises instantly to peaks but falls slowly). This lets you see both the current signal and recent peaks simultaneously.

The peak hold algorithm per frequency bin:

If the current value exceeds the stored peak: instantly adopt the new value
Otherwise: decay the peak toward the current value using an exponential decay

peak[i] = max(current[i], peak[i] * decay_rate)

The decay rate controls how fast peaks fall. At 60 fps with decay = 0.995, a peak takes about 3 seconds to decay to half. With decay = 0.98, it's about 0.5 seconds. Musical audio analyzers typically use 1–2 second hold times.

In professional systems, "peak hold" and "peak decay" are separate: the peak holds at its maximum for a fixed time (e.g., 2 seconds), then drops at a fixed rate. The simpler exponential decay is fine for a visualizer.

Tip

Approach: Allocate a peak_db array alongside mag_db. After computing the FFT magnitude, update each bin: peak_db[i] = max(mag_db[i], peak_db[i] * 0.995). In the spectrum drawing function, after drawing the colored bars, draw a single white pixel at each peak position.

Tip

Solution:

Add the peak buffer (allocate alongside mag_db):

float *peak_db = calloc(half_fft, sizeof(float));
/* Initialize to floor */
for (int i = 0; i < half_fft; i++) peak_db[i] = -80.0f;

After computing mag_db in the have_data block:

/* Update peak hold */
for (int i = 0; i < half_fft; i++) {
    if (mag_db[i] > peak_db[i])
        peak_db[i] = mag_db[i];
    else
        peak_db[i] = peak_db[i] * 0.995f + mag_db[i] * 0.005f;
}

After calling draw_spectrum(), draw the peak line:

/* Peak hold line (white) */
SDL_SetRenderDrawColor(ren, 255, 255, 255, 200);
float db_range = db_ceil - db_floor;
for (int i = 0; i < left_w; i++) {
    int bin = i * (half_fft - 1) / left_w;
    float norm = (peak_db[bin] - db_floor) / db_range;
    if (norm < 0) norm = 0;
    if (norm > 1) norm = 1;
    int py = y_left_before + spec_bar_h - (int)(norm * spec_bar_h);
    SDL_RenderDrawPoint(ren, margin + i, py);
    SDL_RenderDrawPoint(ren, margin + i, py + 1);  /* 2px thick */
}

Don't forget to free(peak_db) in cleanup.

Challenge 3: Musical Note Tuner

Detect the dominant frequency and display the nearest musical note name (A4 = 440 Hz).

Theory: Musical Pitch and Equal Temperament

Western music divides each octave into 12 equal steps (semitones). In equal temperament, each semitone is a frequency ratio of 2^(1/12) = 1.05946...

Given a reference pitch A4 = 440 Hz, any frequency can be converted to a semitone number:

semitones_from_A4 = 12 * log2(freq / 440.0)

Round to the nearest integer to find the closest note. The fractional part tells you how many cents sharp or flat the pitch is (1 semitone = 100 cents):

cents_off = (semitones_from_A4 - round(semitones_from_A4)) * 100

The note names cycle: A, A#, B, C, C#, D, D#, E, F, F#, G, G#

Semitones from A4	Note	Frequency
-12	A3	220.0 Hz
-9	C4	261.6 Hz
0	A4	440.0 Hz
3	C5	523.3 Hz
12	A5	880.0 Hz

FFT resolution matters: With 1024 samples at 48 kHz, each bin is 46.9 Hz wide. At 440 Hz, the nearest bins are at 421.9 Hz and 468.8 Hz — that's more than a semitone of uncertainty. For better pitch detection, use a larger FFT (4096+), or interpolate the peak position.

Tip

Approach: After finding the dominant frequency (which the app already does), convert to semitones from A4, find the nearest note name, and calculate cents deviation. Display as "A4 +12c" or "C#5 -8c" next to the frequency readout.

Tip

Solution:

Add a helper function:

static const char *note_names[] = {
    "A", "A#", "B", "C", "C#", "D",
    "D#", "E", "F", "F#", "G", "G#"
};

static void freq_to_note(float freq, char *buf, int buflen)
{
    if (freq < 20.0f || freq > 20000.0f) {
        snprintf(buf, buflen, "---");
        return;
    }

    /* Semitones from A4 (440 Hz) */
    float semitones = 12.0f * log2f(freq / 440.0f);
    int nearest = (int)roundf(semitones);
    float cents = (semitones - nearest) * 100.0f;

    /* Note name and octave */
    int note_idx = ((nearest % 12) + 12) % 12;
    int octave = 4 + (nearest + 9) / 12;  /* A4 = octave 4 */
    if (nearest + 9 < 0)
        octave = 4 + (nearest + 9 - 11) / 12;

    snprintf(buf, buflen, "%2s%d %+3.0fc",
             note_names[note_idx], octave, cents);
}

In the stats display section:

char note[16];
freq_to_note(dom_freq_avg, note, sizeof(note));
snprintf(stat, sizeof(stat), "Note %s", note);
draw_text(ren, stat, right_x, sy, txt_scale);

For better accuracy with the standard 1024-sample FFT, use parabolic interpolation to refine the peak frequency:

/* Refine peak frequency using parabolic interpolation */
if (peak_bin > 0 && peak_bin < bin_max - 1) {
    float alpha = mag_db[peak_bin - 1];
    float beta  = mag_db[peak_bin];
    float gamma = mag_db[peak_bin + 1];
    float p = 0.5f * (alpha - gamma) / (alpha - 2*beta + gamma);
    dom_freq_now = (peak_bin + p) * sample_rate / fft_size;
}

Challenge 4: WAV File Recording

Press R to start/stop recording audio to a WAV file.

Theory: WAV File Format

WAV (RIFF WAVE) is the simplest uncompressed audio format. The file has a 44-byte header followed by raw PCM samples:

Bytes 0-3:   "RIFF"
Bytes 4-7:   File size - 8
Bytes 8-11:  "WAVE"
Bytes 12-15: "fmt "
Bytes 16-19: 16 (chunk size)
Bytes 20-21: 1 (PCM format)
Bytes 22-23: channels
Bytes 24-27: sample rate
Bytes 28-31: byte rate (rate * channels * bits/8)
Bytes 32-33: block align (channels * bits/8)
Bytes 34-35: bits per sample (16)
Bytes 36-39: "data"
Bytes 40-43: data size (filled in when recording stops)
Bytes 44+:   raw PCM samples

The trick is that the file size and data size fields aren't known when recording starts. Write placeholder zeros, then seek back and fill them in when recording stops.

For embedded systems, WAV is ideal: no compression library needed, any audio tool can open it, and you can write samples directly from the capture buffer without conversion.

Tip

Approach: Add a FILE *wav_file and int recording flag. On 'R' keypress, open the file and write the 44-byte header with placeholder sizes. In the audio processing loop, write float samples converted to int16_t. On second 'R' press (or quit), seek back to bytes 4 and 40, write the actual sizes, and close the file.

Tip

Solution:

static FILE *wav_file = NULL;
static uint32_t wav_data_bytes = 0;

static void wav_start(const char *path, int rate, int ch)
{
    wav_file = fopen(path, "wb");
    if (!wav_file) { perror("fopen wav"); return; }

    /* Write header with placeholder sizes */
    uint16_t bits = 16;
    uint16_t fmt = 1;  /* PCM */
    uint32_t byte_rate = rate * ch * bits / 8;
    uint16_t block_align = ch * bits / 8;
    uint32_t zero = 0;

    fwrite("RIFF", 1, 4, wav_file);
    fwrite(&zero, 4, 1, wav_file);         /* file size placeholder */
    fwrite("WAVEfmt ", 1, 8, wav_file);
    uint32_t chunk = 16;
    fwrite(&chunk, 4, 1, wav_file);
    fwrite(&fmt, 2, 1, wav_file);
    fwrite(&ch, 2, 1, wav_file);           /* note: ch as uint16 */
    fwrite(&rate, 4, 1, wav_file);
    fwrite(&byte_rate, 4, 1, wav_file);
    fwrite(&block_align, 2, 1, wav_file);
    fwrite(&bits, 2, 1, wav_file);
    fwrite("data", 1, 4, wav_file);
    fwrite(&zero, 4, 1, wav_file);         /* data size placeholder */

    wav_data_bytes = 0;
    printf("Recording to %s...\n", path);
}

static void wav_write(const float *buf, int n)
{
    if (!wav_file) return;
    /* Convert float [-1..1] to int16 */
    for (int i = 0; i < n; i++) {
        float s = buf[i];
        if (s > 1.0f) s = 1.0f;
        if (s < -1.0f) s = -1.0f;
        int16_t sample = (int16_t)(s * 32767);
        fwrite(&sample, 2, 1, wav_file);
    }
    wav_data_bytes += n * 2;
}

static void wav_stop(void)
{
    if (!wav_file) return;
    /* Fill in sizes */
    uint32_t file_size = wav_data_bytes + 36;
    fseek(wav_file, 4, SEEK_SET);
    fwrite(&file_size, 4, 1, wav_file);
    fseek(wav_file, 40, SEEK_SET);
    fwrite(&wav_data_bytes, 4, 1, wav_file);
    fclose(wav_file);
    wav_file = NULL;
    printf("Recording stopped (%u bytes)\n", wav_data_bytes);
}

In the event loop, add R key handling:

if (ev.type == SDL_KEYDOWN && ev.key.keysym.sym == SDLK_r) {
    if (wav_file)
        wav_stop();
    else
        wav_start("recording.wav", sample_rate, channels);
}

In the have_data block, after high-pass filtering:

if (wav_file) {
    wav_write(ch1_buf, period_frames);
    if (channels == 2)
        wav_write(ch2_buf, period_frames);
}

In cleanup (before SDL_Quit):

wav_stop();  /* finalize if still recording */

Challenge 5: Band-Pass Filter

Add a configurable band-pass filter and display the filtered signal alongside the raw.

Theory: Band-Pass Filters

A band-pass filter passes frequencies within a range and attenuates everything else. It's the combination of a high-pass (removes low frequencies) and a low-pass (removes high frequencies).

The simplest implementation cascades two 1-pole filters — the same type already used in audio_viz.c for DC removal:

Input → High-pass (f_low) → Low-pass (f_high) → Output

1-pole IIR low-pass filter:

alpha = dt / (RC + dt)      where RC = 1 / (2*pi*f_cutoff)
y[n] = alpha * x[n] + (1 - alpha) * y[n-1]

1-pole IIR high-pass filter (already in the app):

alpha = RC / (RC + dt)
y[n] = alpha * (y[n-1] + x[n] - x[n-1])

For steeper rolloff, cascade multiple stages (each 1-pole = 6 dB/octave). Two stages give 12 dB/octave.

A better approach for precise filtering is the biquad filter (second-order IIR), which gives 12 dB/octave per section with configurable Q (bandwidth). See the Signal Processing Reference for biquad coefficient calculation.

Tip

Approach: Add -L (low cutoff) and -H (high cutoff) flags. Apply the existing highpass_1pole at f_low, then add a lowpass_1pole at f_high. Display the filtered waveform in a different color below the raw waveform.

Tip

Solution:

Add a low-pass filter function (mirrors the existing high-pass):

static void lowpass_1pole(float *buf, int n, float *y_prev, float alpha)
{
    for (int i = 0; i < n; i++) {
        float y = alpha * buf[i] + (1.0f - alpha) * *y_prev;
        *y_prev = y;
        buf[i] = y;
    }
}

The alpha for low-pass:

/* Compute low-pass alpha from cutoff frequency */
float dt = 1.0f / sample_rate;
float rc_lp = 1.0f / (2.0f * M_PI * lp_cutoff);
float lp_alpha = dt / (rc_lp + dt);

Apply after the high-pass in the have_data block:

if (lp_cutoff > 0) {
    lowpass_1pole(ch1_buf, period_frames, &lp_y1, lp_alpha);
    if (channels == 2)
        lowpass_1pole(ch2_buf, period_frames, &lp_y2, lp_alpha);
}

Challenge 6: Sub-Sample TDOA with Parabolic Interpolation

Improve direction resolution by interpolating the GCC-PHAT peak position between samples.

Theory: Why Sub-Sample Matters

With 6 cm mic spacing at 48 kHz, the maximum delay is about 8.4 samples. Integer-only peak detection gives ~8 distinct angles — very coarse. Sound from 85 degrees and 90 degrees maps to the same sample delay.

Parabolic interpolation fits a parabola through the peak and its two neighbors to estimate the true peak position between samples:

    *         ← true peak (between samples)
   / \
  /   \
 *     *     ← sampled values (bins k-1, k, k+1)

The fractional offset from the integer peak position is:

p = 0.5 * (R[k-1] - R[k+1]) / (R[k-1] - 2*R[k] + R[k+1])

where R[k] is the correlation value at lag k. The refined lag is then k + p.

This typically improves angular resolution from ~10 degrees to ~2 degrees with the same hardware.

Tip

Solution:

In the find_peak_lag function, after finding the integer peak, add interpolation:

static float find_peak_lag_subsample(const float *corr, int n, int max_lag)
{
    /* First find integer peak */
    float best = -1e30f;
    int best_i = 0;
    for (int i = 0; i < max_lag; i++) {
        if (corr[i] > best) { best = corr[i]; best_i = i; }
    }
    for (int i = n - max_lag; i < n; i++) {
        if (corr[i] > best) { best = corr[i]; best_i = i; }
    }

    /* Parabolic interpolation */
    int km1 = (best_i - 1 + n) % n;
    int kp1 = (best_i + 1) % n;
    float denom = corr[km1] - 2*corr[best_i] + corr[kp1];
    float p = 0;
    if (fabsf(denom) > 1e-10f)
        p = 0.5f * (corr[km1] - corr[kp1]) / denom;

    /* Convert to signed lag */
    float lag = (float)best_i + p;
    if (lag > n / 2) lag -= n;
    return lag;
}

Then change the caller to use float lag:

float lag = find_peak_lag_subsample(corr, fft_size, max_lag);
delay_samples = delay_samples * 0.85f + lag * 0.15f;

Challenge 7: Real-Time CPU Budget Analysis

Measure how much CPU time each processing stage takes. Can you double the FFT size? Can you add more filters?

This challenge doesn't add a feature — it builds understanding of real-time constraints.

Theory: The Real-Time Budget

At 48 kHz with a 1024-sample period, you get a new audio block every 21.3 ms. All processing (filter, FFT, GCC-PHAT, rendering) must complete within this window, or you drop audio.

On a Raspberry Pi 4 (Cortex-A72 at 1.5 GHz):

Operation	Typical time	% of budget
High-pass filter (1024 samples)	~0.01 ms	0.05%
Hann window (1024 samples)	~0.01 ms	0.05%
FFT (1024-point, FFTW)	~0.05 ms	0.2%
Magnitude + dB (513 bins)	~0.02 ms	0.1%
GCC-PHAT (2x FFT + cross + IFFT)	~0.2 ms	1%
SDL2 rendering	~2-5 ms	10-25%
Total	~3-6 ms	15-30%

There's plenty of headroom. But what if you increase the FFT to 16384 points? FFT is O(N log N), so it scales roughly 16x, and rendering scales with width. At some point, you'll exceed the budget and hear glitches.

See Signal Processing Reference for architecture decisions when CPU isn't enough.

Tip

Approach: Add clock_gettime(CLOCK_MONOTONIC) calls around each processing stage. Print a timing breakdown every 100 frames. Try increasing -n from 1024 to 2048, 4096, 8192, 16384 and observe when audio starts glitching.

Tip

Solution:

/* Timing instrumentation */
struct timespec t0, t1, t2, t3;
static float avg_filter_us = 0, avg_fft_us = 0, avg_render_us = 0;
static int timing_count = 0;

clock_gettime(CLOCK_MONOTONIC, &t0);
/* ... filter code ... */
clock_gettime(CLOCK_MONOTONIC, &t1);
/* ... FFT code ... */
clock_gettime(CLOCK_MONOTONIC, &t2);
/* ... render code ... */
clock_gettime(CLOCK_MONOTONIC, &t3);

#define TDIFF_US(a,b) ((b.tv_sec-a.tv_sec)*1e6 + (b.tv_nsec-a.tv_nsec)/1e3)
avg_filter_us = avg_filter_us * 0.99f + TDIFF_US(t0,t1) * 0.01f;
avg_fft_us    = avg_fft_us    * 0.99f + TDIFF_US(t1,t2) * 0.01f;
avg_render_us = avg_render_us * 0.99f + TDIFF_US(t2,t3) * 0.01f;

if (++timing_count % 100 == 0) {
    float budget_us = 1e6f * period_frames / sample_rate;
    printf("Budget: %.0f us | Filter: %.0f | FFT: %.0f | "
           "Render: %.0f | Used: %.1f%%\n",
           budget_us, avg_filter_us, avg_fft_us, avg_render_us,
           (avg_filter_us + avg_fft_us + avg_render_us) / budget_us * 100);
}

Expected results with increasing FFT size:

FFT size	Freq resolution	Budget	CPU used
1024	46.9 Hz	21.3 ms	~15%
4096	11.7 Hz	85.3 ms	~5%
8192	5.9 Hz	170.7 ms	~4%
16384	2.9 Hz	341.3 ms	~3%

Notice: larger FFT means better frequency resolution but higher latency (the app must wait for more samples to arrive before it can compute).

Back to: I2S Audio Visualizer | Signal Processing Reference