Signal Processing for Embedded Systems

Goal: Understand the signal processing fundamentals behind audio capture, spectral analysis, and direction estimation -- the theory beneath the DSP pipeline you built in the Audio Visualizer tutorial.

1. Sampling and Quantization

Every signal processing chain starts with an analog-to-digital converter (ADC). Understanding what happens at this boundary explains many of the artifacts and limitations you will encounter downstream.

The Sampling Theorem

A continuous signal can be perfectly reconstructed from its samples if and only if the sampling rate is at least twice the highest frequency present in the signal:

f_sample >= 2 * f_max        (Nyquist criterion)

The frequency f_sample / 2 is called the Nyquist frequency. Any signal content above this frequency folds back into the spectrum as aliasing -- a false low-frequency component that cannot be distinguished from real data. This is why every ADC front-end includes an anti-aliasing filter (a low-pass analog filter) before the sampler.

In the audio visualizer, the I2S microphone samples at 44100 Hz, so the highest representable frequency is 22050 Hz -- just above the upper limit of human hearing. The microphone's MEMS element naturally rolls off above ~20 kHz, acting as a built-in anti-aliasing filter.

ADC Architectures

Different ADC types serve different embedded use cases:

Architecture	Speed	Resolution	Relative Cost	Typical Use
SAR (Successive Approximation)	Medium (1-5 MSPS)	10-18 bits	Low	General-purpose sensor readout, MCU built-in ADCs
Delta-Sigma	Low (10-768 kSPS)	16-32 bits	Medium	Audio (I2S mics), precision measurement, weigh scales
Pipeline	High (10-500+ MSPS)	8-16 bits	High	Video, radar, communications, software-defined radio

Tip

The INMP441 I2S microphone used in the audio visualizer contains a delta-sigma ADC. Delta-sigma converters trade speed for resolution -- they oversample at a very high rate (often several MHz) and use a digital decimation filter to produce high-resolution samples at the target rate (e.g., 44.1 kHz). This is why they dominate audio applications: 24-bit resolution with excellent noise performance.

Quantization Noise and SNR

An ideal ADC with N bits has a signal-to-noise ratio determined solely by the number of quantization levels:

SNR_ideal = 6.02 * N + 1.76  dB

Bits	Ideal SNR	Typical Application
8	49.9 dB	Low-cost sensors, 8-bit MCU ADCs
12	74.0 dB	General-purpose MCU ADCs (RP2040, STM32)
16	98.1 dB	CD audio, I2S MEMS microphones
24	146.2 dB	Studio audio, precision instrumentation

Real ADCs never achieve ideal SNR. The Effective Number of Bits (ENOB) measures how many bits of actual resolution you get after accounting for noise, distortion, and nonlinearity:

ENOB = (SINAD - 1.76) / 6.02

where SINAD is the signal-to-noise-and-distortion ratio. A "16-bit" ADC might have an ENOB of 13-14 in practice.

I2S Audio Formats

The I2S bus carries digitized samples from the microphone. The format determines how each sample is packed into the bit stream:

Format	Bytes/Sample	Range	Notes
`S16_LE`	2	-32768 to +32767	Most common for MEMS mics, sufficient for visualization
`S24_LE`	3 (packed) or 4 (in 32-bit word)	-8388608 to +8388607	Higher dynamic range, often padded to 32-bit
`S32_LE`	4	-2147483648 to +2147483647	Full 32-bit container, actual precision depends on ADC

The S prefix means signed, LE means little-endian. In the audio visualizer, ALSA delivers samples in the configured format and your code normalizes them to floating-point before processing.

2. Time-Domain Processing

Before computing an FFT, you typically need to condition the raw samples. The audio visualizer applies a high-pass filter and windowing -- here is the theory behind those steps.

DC Removal

MEMS microphones often have a DC offset (a nonzero average value in the sample stream). This offset creates a large spike at bin 0 of the FFT that can dominate the display and reduce dynamic range for actual audio content.

Two approaches:

Subtract the block mean -- simple, effective for block processing:

// Remove DC from a block of samples
float mean = 0.0f;
for (int i = 0; i < N; i++) mean += samples[i];
mean /= N;
for (int i = 0; i < N; i++) samples[i] -= mean;

High-pass filter -- better for streaming, removes DC and near-DC rumble continuously. The audio visualizer uses the 1-pole IIR approach described below.

Moving Average Filter

The simplest low-pass filter. The output is the average of the last M input samples:

y[n] = (1/M) * (x[n] + x[n-1] + ... + x[n-M+1])

In C:

// Causal moving average filter (FIR, M taps)
float moving_average(float *buf, int M) {
    float sum = 0.0f;
    for (int i = 0; i < M; i++) sum += buf[i];
    return sum / (float)M;
}

Frequency response: The moving average has a sinc-shaped frequency response with nulls at multiples of f_sample / M. It is excellent at removing random noise but poor at separating nearby frequencies -- it has very wide transition bands. Use it for smoothing sensor data (accelerometer, temperature), not for audio frequency selection.

FIR vs IIR Filters

Property	FIR (Finite Impulse Response)	IIR (Infinite Impulse Response)
Stability	Always stable	Can be unstable if poorly designed
Phase	Can be exactly linear phase	Nonlinear phase (unless all-pass)
Order for sharp cutoff	High (many taps)	Low (few coefficients)
Computation per sample	`N` multiplies + adds	`2*N` multiplies + adds (but N is small)
Memory	Stores N past inputs	Stores past inputs and outputs
Fixed-point behavior	Well-behaved	Sensitive to coefficient quantization
Typical use	Audio mastering, linear-phase EQ	Real-time control, embedded DSP, audio effects

For embedded systems with tight CPU budgets, IIR filters are usually preferred because you get a sharp frequency response with far fewer operations. The audio visualizer's high-pass filter is IIR.

1-Pole High-Pass and Low-Pass Filters

The simplest useful IIR filters. A single coefficient alpha controls the cutoff:

Low-pass (smoothing):

// 1-pole low-pass: y[n] = alpha * x[n] + (1 - alpha) * y[n-1]
float lp_filter(float input, float *state, float alpha) {
    *state = alpha * input + (1.0f - alpha) * (*state);
    return *state;
}

High-pass (DC removal):

// 1-pole high-pass: derived from low-pass
// y[n] = alpha * (y[n-1] + x[n] - x[n-1])
float hp_filter(float input, float *prev_input, float *prev_output, float alpha) {
    float output = alpha * (*prev_output + input - *prev_input);
    *prev_input = input;
    *prev_output = output;
    return output;
}

The relationship between alpha and the -3 dB cutoff frequency:

For high-pass:  alpha = 1 / (1 + 2*pi * f_cutoff / f_sample)    (approximately)
For low-pass:   alpha = (2*pi * f_cutoff / f_sample) / (1 + 2*pi * f_cutoff / f_sample)

Tip

Choosing a cutoff frequency for the audio visualizer:

For DC removal: set f_cutoff to 20-30 Hz. This removes the DC offset and subsonic rumble without affecting audible content. At 44100 Hz sample rate, alpha ~= 0.996.
For smoothing spectrum magnitudes: a low-pass with f_cutoff around 5-10 Hz (applied per-bin across frames) gives a smooth, readable display without losing transient response.
Lower alpha means more smoothing (low-pass) or higher cutoff (high-pass). If the display is too jumpy, decrease alpha on the smoother. If bass frequencies are missing, increase alpha on the high-pass.

3. Frequency-Domain Processing

The FFT converts a block of time-domain samples into frequency-domain coefficients. This is the core of the spectrum display and spectrogram in the audio visualizer.

DFT and FFT Intuition

The Discrete Fourier Transform decomposes a block of N samples into N complex coefficients, each representing the amplitude and phase of a specific frequency:

Bin k corresponds to frequency:  f_k = k * f_sample / N

Bin 0   -> DC (0 Hz)
Bin 1   -> f_sample / N
Bin N/2 -> f_sample / 2  (Nyquist)

The FFT is an algorithm that computes the DFT in O(N log N) operations instead of O(N^2). The most common variant (Cooley-Tukey) requires N to be a power of 2. In the audio visualizer, FFTW3 handles this.

Frequency Resolution

delta_f = f_sample / N

FFT Size (N)	Sample Rate	Frequency Resolution	Block Duration
256	44100 Hz	172.3 Hz	5.8 ms
512	44100 Hz	86.1 Hz	11.6 ms
1024	44100 Hz	43.1 Hz	23.2 ms
2048	44100 Hz	21.5 Hz	46.4 ms
4096	44100 Hz	10.8 Hz	92.9 ms

There is a fundamental tradeoff: longer blocks give finer frequency resolution but worse time resolution. If you need to track rapid changes (drum hits, speech syllables), use a smaller FFT. If you need to distinguish closely-spaced tones, use a larger FFT.

Windowing

Without windowing, the FFT treats your block of samples as if it repeats infinitely. Since the block boundaries usually cut through waveform cycles, this creates artificial discontinuities that spread energy across all bins -- a phenomenon called spectral leakage.

A window function tapers the samples to zero at the block edges, reducing leakage at the cost of slightly reduced frequency resolution:

Window	Main Lobe Width	Sidelobe Level	Best For
Rectangular (none)	Narrowest	-13 dB	Transient analysis, when block = exact period
Hann	1.5x	-31 dB	General audio analysis (used in audio visualizer)
Hamming	1.4x	-43 dB	Speech processing, similar to Hann with less ripple
Blackman	1.7x	-58 dB	When sidelobe suppression matters more than resolution
Flat-top	3.8x	-44 dB	Amplitude calibration (flattest passband)

The audio visualizer applies a Hann window -- a good default for visualization because it provides a clean spectrum without excessive sidelobe artifacts:

// Pre-compute Hann window coefficients
for (int i = 0; i < N; i++) {
    window[i] = 0.5f * (1.0f - cosf(2.0f * M_PI * i / (N - 1)));
}

// Apply window before FFT
for (int i = 0; i < N; i++) {
    windowed[i] = samples[i] * window[i];
}

Magnitude, Phase, and Power Spectrum

The FFT output is complex. To display it:

// For each bin k (0 to N/2):
float re = fft_out[k][0];     // real part
float im = fft_out[k][1];     // imaginary part

float magnitude = sqrtf(re*re + im*im);       // amplitude spectrum
float phase     = atan2f(im, re);              // phase spectrum (radians)
float power     = re*re + im*im;               // power spectrum (magnitude squared)

The magnitude spectrum is what the audio visualizer displays as bars. The phase spectrum is needed for cross-correlation (GCC-PHAT). The power spectrum is used when you need energy measurements.

The dB Scale

Human hearing spans roughly 120 dB of dynamic range. Linear magnitude values compress most of the interesting detail into a tiny range near zero. The dB scale is logarithmic and matches how we perceive loudness:

magnitude_dB = 20 * log10(magnitude / reference)
power_dB     = 10 * log10(power / reference)

Common reference levels:

Scale	Reference	Meaning
dBFS (full scale)	Maximum possible sample value	0 dBFS = loudest possible digital signal
dBSPL (sound pressure)	20 uPa	0 dBSPL = threshold of hearing
dBV	1 Vrms	Used for analog audio levels

In the audio visualizer, dBFS is the natural choice -- normalize the FFT magnitude by N/2 (to account for the FFT scaling), then compute 20 * log10(magnitude_normalized). Silence will be a large negative dB value; a full-scale sine wave will be 0 dBFS.

Short-Time FFT (STFT) and Spectrogram

The spectrogram in the audio visualizer is a sequence of FFT frames plotted over time. Each column represents one FFT block, with frequency on the vertical axis and magnitude mapped to color.

Parameters that control the spectrogram:

FFT size (N): sets frequency resolution
Hop size: how many new samples between successive FFT frames. Hop = N means no overlap; hop = N/2 means 50% overlap
Overlap: more overlap gives smoother time resolution but costs more CPU. 50% overlap with a Hann window is a common choice -- it satisfies the "constant overlap-add" condition, meaning no signal energy is lost

Time resolution of spectrogram = hop_size / f_sample
Frequency resolution            = f_sample / N

4. Cross-Correlation and TDOA

With two microphones, you can estimate the direction of a sound source by measuring the time difference of arrival (TDOA). The audio visualizer implements this using GCC-PHAT.

Auto-Correlation vs Cross-Correlation

Auto-correlation measures how similar a signal is to a delayed copy of itself. It peaks at lag 0 and reveals periodicity (useful for pitch detection).

Cross-correlation measures how similar two different signals are at various relative delays. The lag at which the cross-correlation peaks tells you the time difference between the two signals.

In the time domain, cross-correlation is expensive -- O(N^2) for a block of N samples. The frequency domain makes it O(N log N):

Cross-correlation via FFT:
  R_xy = IFFT( FFT(x) * conj(FFT(y)) )

GCC-PHAT Step by Step

Generalized Cross-Correlation with Phase Transform (GCC-PHAT) is a robust variant that whitens the spectrum before correlation, making it less sensitive to reverberations and coloration:

1. Compute FFT of left channel:   X = FFT(x)
2. Compute FFT of right channel:  Y = FFT(y)
3. Compute cross-power spectrum:  G = X * conj(Y)
4. Normalize (PHAT weighting):    G_phat = G / |G|
5. Inverse FFT:                   gcc = IFFT(G_phat)
6. Find the peak lag:             tau = argmax(gcc)

In C, using FFTW3:

// Step 3-4: cross-power spectrum with PHAT weighting
for (int k = 0; k < N/2 + 1; k++) {
    float xr = fft_left[k][0],  xi = fft_left[k][1];
    float yr = fft_right[k][0], yi = fft_right[k][1];

    // G = X * conj(Y)
    float gr = xr * yr + xi * yi;
    float gi = xi * yr - xr * yi;

    // PHAT: normalize by magnitude
    float mag = sqrtf(gr * gr + gi * gi) + 1e-10f;  // epsilon avoids division by zero
    gcc_freq[k][0] = gr / mag;
    gcc_freq[k][1] = gi / mag;
}

// Step 5: IFFT to get GCC-PHAT in time domain
fftwf_execute(ifft_plan);

// Step 6: find peak in valid lag range
int max_lag = (int)(mic_distance / speed_of_sound * sample_rate);
int best_lag = 0;
float best_val = -1.0f;
for (int lag = -max_lag; lag <= max_lag; lag++) {
    int idx = (lag + N) % N;  // wrap negative lags
    if (gcc_time[idx] > best_val) {
        best_val = gcc_time[idx];
        best_lag = lag;
    }
}

Delay-to-Angle Conversion

For two microphones separated by distance d, the time delay maps to an angle of arrival:

tau = delay_samples / f_sample          (delay in seconds)
theta = arcsin(tau * v_sound / d)       (angle from broadside)

where v_sound is approximately 343 m/s at room temperature.

The maximum detectable delay is d / v_sound -- when the sound arrives from directly along the microphone axis (endfire). Sounds arriving from broadside (perpendicular to the axis) produce zero delay.

Resolution Limits and Sub-Sample Interpolation

The raw GCC-PHAT peak has a resolution of one sample period. At 44100 Hz with microphones 10 cm apart:

Maximum delay = 0.10 / 343 = 291 us = 12.8 samples
Angular resolution ~ arcsin(1 / 12.8) ~ 4.5 degrees per sample

Parabolic interpolation improves this by fitting a parabola through the peak and its two neighbors:

// Parabolic interpolation for sub-sample delay estimate
float y_minus  = gcc_time[(best_lag - 1 + N) % N];
float y_center = gcc_time[(best_lag + N) % N];
float y_plus   = gcc_time[(best_lag + 1 + N) % N];

float delta = 0.5f * (y_minus - y_plus) / (y_minus - 2.0f * y_center + y_plus);
float refined_lag = (float)best_lag + delta;

This typically improves angular resolution by a factor of 5-10x with negligible computational cost.

Tip

Practical improvements for the audio visualizer's direction estimation:

Apply a band-pass filter (300 Hz - 4000 Hz) before GCC-PHAT to focus on speech frequencies and reduce noise.
Average GCC-PHAT results over 3-5 consecutive frames to reduce jitter in the direction display.
Use the peak height as a confidence metric -- a sharp peak means a clear single source; a flat or multi-peaked GCC means diffuse sound or multiple sources.

5. Filtering in Practice

The 1-pole filters from Section 2 are useful but limited. For more selective filtering (band-pass, notch, equalization), the biquad filter is the standard building block.

Biquad Filters (Second-Order IIR)

A biquad implements a second-order transfer function with 5 coefficients:

y[n] = (b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]) / a0

In direct form II transposed (numerically better for floating-point):

typedef struct {
    float b0, b1, b2, a1, a2;  // coefficients (a0 normalized to 1)
    float z1, z2;               // state variables
} biquad_t;

float biquad_process(biquad_t *f, float input) {
    float output = f->b0 * input + f->z1;
    f->z1 = f->b1 * input - f->a1 * output + f->z2;
    f->z2 = f->b2 * input - f->a2 * output;
    return output;
}

By choosing different coefficient formulas, the same biquad structure implements many filter types:

Filter Type	Use Case
Low-pass	Anti-aliasing before downsampling, smoothing
High-pass	DC removal, rumble suppression
Band-pass	Isolating a frequency range (e.g., voice band for GCC-PHAT)
Notch (band-reject)	Removing power-line hum (50/60 Hz)
Peaking EQ	Boosting or cutting a specific frequency band
All-pass	Phase adjustment without changing magnitude

Cascading Sections

For sharper filters, cascade multiple biquad sections. A 4th-order Butterworth low-pass is two cascaded biquads; an 8th-order is four. Each section has its own state variables, so they process independently:

// 4th-order Butterworth = 2 cascaded biquads
float output = biquad_process(&stage1, input);
output = biquad_process(&stage2, output);

Tip

Audio equalizers are typically 5-10 cascaded biquads, each tuned to a different center frequency. The total cost is still very low: 5-10 multiply-add operations per sample per stage. On an ARM Cortex-A at 1 GHz, you can run hundreds of biquad stages in real time at 44.1 kHz.

Fixed-Point vs Floating-Point Considerations

Aspect	Fixed-Point (Q15, Q31)	Floating-Point (float32)
Precision	Limited, manual scaling needed	~7 decimal digits, wide dynamic range
IIR filter stability	Coefficient quantization can cause instability, limit cycles	Rarely an issue
Speed on Cortex-M	Fast (hardware multiply)	Slow without FPU
Speed on Cortex-A	No advantage	Fast (hardware FPU + NEON)
Development effort	High (overflow management, scaling)	Low

For the Raspberry Pi (Cortex-A with FPU and NEON), always use floating-point. Fixed-point is relevant when targeting microcontrollers without an FPU (Cortex-M0, M3) or when interfacing with DSP-specific hardware.

6. When CPU Isn't Enough -- Architecture Decisions

This is the central question for embedded systems engineers: can the processor handle the DSP workload in real time, and if not, what alternatives exist?

Processing Platform Comparison

Platform	Example	Peak DSP Throughput	Latency	Power	Flexibility	Typical Audio Role
CPU (general)	Cortex-A72 (Pi 4)	~10 GFLOPS	Medium (us)	3-5W	Highest	Audio effects, visualization, control logic
CPU + SIMD	Cortex-A72 NEON	~20 GFLOPS	Medium (us)	3-5W	High	FFT, filters, matrix operations
DSP	TI C6748, SHARC	1-10 GMACS	Low (ns-us)	0.5-2W	Medium	Dedicated audio processing, codecs
FPGA	Xilinx Zynq, Intel Cyclone	Configurable	Lowest (ns)	1-10W	Low (HDL)	Sample-rate processing, multi-channel, radar
GPU	Jetson Nano, RPi VideoCore	100+ GFLOPS	Higher (ms)	2-15W	Medium (CUDA/OpenCL)	Batch FFT, neural audio (denoising, separation)

When to Use Each

CPU is sufficient when:

Channel count is low (1-8 channels)
Sample rate is moderate (up to 192 kHz)
Processing is block-based (FFT, block filtering)
Latency tolerance is above 1 ms
The audio visualizer runs comfortably on a Pi 4 CPU

DSP is warranted when:

Deterministic real-time response is required (hearing aids, active noise cancellation)
Power budget is very tight
The algorithm is well-defined and unlikely to change

FPGA is warranted when:

Sample rates exceed what a CPU can handle (>10 MSPS)
Per-sample latency must be under 1 us
Many identical channels need parallel processing (phased arrays, beamforming)
Custom interfaces are needed (non-standard ADC timing)

GPU is warranted when:

Large batch operations dominate (training neural networks, batch spectral analysis)
Latency tolerance is 5-50 ms (not hard real-time)
The algorithm is highly parallel and data-independent

NEON/SIMD on ARM

The Raspberry Pi 4's Cortex-A72 has 128-bit NEON SIMD units that process 4 floats simultaneously. Many DSP operations map naturally to SIMD:

// Without NEON: scalar multiply-accumulate
for (int i = 0; i < N; i++) {
    output[i] = input[i] * coeff[i];
}

// With NEON intrinsics: 4x throughput
#include <arm_neon.h>
for (int i = 0; i < N; i += 4) {
    float32x4_t in  = vld1q_f32(&input[i]);
    float32x4_t c   = vld1q_f32(&coeff[i]);
    float32x4_t out = vmulq_f32(in, c);
    vst1q_f32(&output[i], out);
}

Libraries like FFTW3 and Ne10 automatically use NEON when available. By using FFTW3 in the audio visualizer, you already benefit from NEON-optimized FFT without writing intrinsics.

Compiler Auto-Vectorization

GCC and Clang can automatically vectorize simple loops when compiled with -O2 -mcpu=cortex-a72 (or -mfpu=neon-vfpv4 on 32-bit). Check the compiler output with -fopt-info-vec (GCC) to see which loops were vectorized. For critical inner loops that the compiler misses, use NEON intrinsics directly.

Real-Time Feasibility Check

Can you process one audio block before the next one arrives?

Available time per block = block_size / sample_rate

Example: 1024-sample block at 44100 Hz
  Available time = 1024 / 44100 = 23.2 ms

FFT of 1024 complex floats on Cortex-A72 (FFTW3, NEON): ~15 us
High-pass filter, 1024 samples: ~5 us
Magnitude computation, 512 bins: ~3 us
Total DSP: ~23 us  (< 1% of available time)

The audio visualizer has enormous headroom. You would need to process roughly 1000x more data (higher sample rates, more channels, larger FFTs) before CPU becomes a bottleneck on the Pi 4.

Computation Budget: Cycles per Sample

A useful way to estimate feasibility:

Operation	Cycles/Sample (Cortex-A72, scalar)	Cycles/Sample (NEON)
1-pole IIR filter	~5	~5 (not vectorizable, data-dependent)
Biquad filter	~10	~10 (same reason)
FIR filter (64 taps)	~64	~16
FFT (per sample, amortized)	~10*log2(N)	~3*log2(N)
GCC-PHAT (per sample, amortized)	~30*log2(N)	~10*log2(N)

At 1.5 GHz and 44.1 kHz sample rate, you have ~34000 cycles per sample. A full DSP chain (high-pass + window + FFT + magnitude + GCC-PHAT) uses roughly 500-1000 cycles per sample -- well within budget.

7. Common Embedded DSP Patterns

These patterns appear in any real-time signal processing system, from the audio visualizer to industrial control.

Ping-Pong Buffers (Double Buffering)

The audio capture thread and DSP thread must not access the same buffer simultaneously. Double buffering solves this:

Buffer A: [capture writes here] --swap--> [DSP reads here]
Buffer B: [DSP reads here]      --swap--> [capture writes here]

The audio visualizer uses a ring buffer variant of this pattern. The capture thread writes blocks into the ring buffer; the render thread reads the latest complete block. The ring buffer naturally handles the producer-consumer synchronization without explicit buffer swaps.

Block Processing vs Sample-by-Sample

Approach	Pros	Cons
Block (process N samples at once)	Cache-friendly, enables FFT, vectorizable	Adds latency equal to block duration
Sample-by-sample	Minimum latency, natural for IIR	Cache-unfriendly, cannot use FFT, harder to vectorize

The audio visualizer uses block processing (FFT requires it). Real-time audio effects that need sub-millisecond latency (e.g., guitar amp simulation) may process sample-by-sample for IIR filters but still use small blocks (32-64 samples) for efficiency.

Overlap-Add / Overlap-Save

When convolving a long input with a long FIR filter, direct convolution is O(N*M). The overlap-add method breaks the input into blocks, convolves each block with the filter using FFT (which is O(N log N)), and stitches the results together:

1. Pad filter to FFT size:  H = FFT(h, N_fft)
2. For each input block b:
   a. B = FFT(b, N_fft)
   b. Y = B * H                    (element-wise multiply)
   c. y = IFFT(Y)
   d. Overlap-add y into output

This is essential when applying room impulse responses (reverb) or long FIR filters -- operations that would be impractical in the time domain.

Look-Up Tables for Trigonometric Functions

Computing sin() and cos() is expensive on embedded processors. For signals with known frequency grids (FFT twiddle factors, tone generation), pre-compute the values:

// Pre-compute a sine table for N points
float sine_table[N];
for (int i = 0; i < N; i++) {
    sine_table[i] = sinf(2.0f * M_PI * i / N);
}
// cos(x) = sin(x + pi/2), so: cosine = sine_table[(i + N/4) % N]

FFTW3 already uses optimized twiddle factor tables internally. This pattern is more relevant when implementing custom oscillators or NCOs (numerically controlled oscillators) for signal generation.

Fixed-Point Arithmetic (Q15, Q31)

On processors without a floating-point unit, fixed-point arithmetic represents fractional values using integers:

Format	Integer Type	Range	Resolution	Usage
Q15	int16_t	-1.0 to +0.999969	1/32768	ARM CMSIS-DSP, 16-bit audio
Q31	int32_t	-1.0 to +0.999999999	1/2147483648	High-precision fixed-point DSP
Q1.15	int16_t	-2.0 to +1.999969	1/16384	When signals can exceed unity

Multiplication in Q15:

// Q15 multiply: result = (a * b) >> 15
int16_t q15_mul(int16_t a, int16_t b) {
    return (int16_t)(((int32_t)a * (int32_t)b) >> 15);
}

When to Care About Fixed-Point

On the Raspberry Pi (Cortex-A with FPU), fixed-point offers no speed advantage and makes code harder to write and debug. It matters when:

Targeting a Cortex-M0/M3 (no FPU) for a companion microcontroller -- see MCU RT Controller
Using a DSP processor with fixed-point MAC units
Implementing DSP in FPGA fabric where floating-point uses many more LUTs
Working with CMSIS-DSP library functions that expect Q15/Q31 input

8. Further Reading

Books

"The Scientist and Engineer's Guide to Digital Signal Processing" by Steven W. Smith -- freely available online, excellent intuitive explanations with minimal math prerequisites
"Understanding Digital Signal Processing" by Richard G. Lyons -- the standard textbook for practical DSP, covers fixed-point, multirate processing, and implementation details
"Digital Signal Processing: A Practical Approach" by Ifeachor and Jervis -- strong on IIR/FIR design and embedded implementation

Libraries and Documentation

FFTW3 -- the FFT library used in the audio visualizer; its documentation explains planner flags, wisdom, and threading
ARM CMSIS-DSP -- ARM's official DSP library for Cortex-M and Cortex-A; includes optimized FFT, filters, matrix operations, and statistical functions in both fixed-point and floating-point
Ne10 -- ARM's open-source NEON-optimized math library; useful when FFTW3 is too heavy
liquid-dsp -- a lightweight C library for software-defined radio and real-time DSP on embedded Linux

Course Connections

The signal processing concepts on this page connect to several other course topics:

Concept	Where It Appears
DMA for audio buffers	DMA Fundamentals
Real-time scheduling for audio threads	Real-Time Systems
IIO subsystem for sensor ADCs	IIO Subsystem
Block I/O patterns	Software Architecture
Profiling DSP performance	Performance Profiling
Device tree for I2S overlays	Device Tree and Drivers

Back to Reference Index | Audio Visualizer Tutorial

Demo	Command	What you'll see
Sampling & Aliasing	`python sampling_aliasing.py -i`	A 440 Hz sine wave sampled at different rates. Drag the slider below Nyquist (880 Hz) and watch the reconstructed signal become a wrong frequency — that's aliasing. Above Nyquist, reconstruction is perfect.
FFT & Windowing	`python fft_windowing.py -i`	Two tones: 440 Hz (strong) + 466 Hz (weak). With a rectangular window, the 440 Hz leakage hides the weak tone. Switch to Hann — the weak tone appears. This is why `audio_viz` uses Hann windowing.
Filter Response	`python filter_response.py -i`	Drag the cutoff frequency and watch the frequency response curve shift. See the time-domain signal change in real time. Compare our 1-pole HP filter (gentle slope) to a 4th-order Butterworth (steep).
Mel Spectrogram	`python mel_spectrogram_explorer.py -i`	6-panel step-by-step: raw waveform → STFT → linear vs mel scale → triangular filterbank → mel spectrogram → normalized CNN input. Adjust FFT size, hop, and mel bands to see how each affects resolution.
ML Decision Boundaries	`python ml_decision_boundary.py -i`	Drag the "Samples" slider from 10 to 500 and watch the SVM decision boundary sharpen as more data arrives. Switch to k-NN k=1 to see extreme overfitting (jagged boundary that memorizes each point).