Skip to content

Sound Detection and Beat Recognition

How do we detect beats when we can only take snapshots of a continuous sound wave?


The Analog-to-Digital Problem

Sound is a continuous wave - pressure oscillations in air. But computers work with discrete numbers.

Continuous WorldDiscrete WorldSound WavesMic VoltageADC NumbersYour Code pressure → voltageprocessADC samples
Continuous WorldDiscrete WorldSound WavesMic VoltageADC NumbersYour Code pressure → voltageprocessADC samples
The Bridge: ADC

The Analog-to-Digital Converter (ADC) is the bridge between worlds. It measures voltage at specific moments and converts it to a number.


How Sampling Works

Taking Snapshots

The ADC doesn't see the whole wave - it takes periodic "snapshots":

What happens between samples? We don't know! That information is lost.

Sample Rate

How often we sample matters:

Sample Rate What We Can Catch
10 Hz Very slow changes only
100 Hz Hand claps, drum hits
1000 Hz Most music beats
44100 Hz Full audio (CD quality)

For beat detection, 50-100 samples/second is usually enough - beats are slow compared to audio.

Nyquist Theorem (Simplified)

To capture a frequency, you need to sample at least 2× that frequency. Music beats are typically 1-3 Hz (60-180 BPM), so even 10 samples/second would work. We use more for better accuracy.


From Raw Samples to Amplitude

The Silence Problem

A microphone outputs voltage centered at a baseline (typically 1.65V = half of 3.3V):

      Voltage
   3.3V ┤                    ┌─ Loud sound
        │         ╱╲      ╱╲ │
   1.65V┤────────╱──╲────╱──╲│───  ← Baseline (silence)
        │       ╱    ╲  ╱    ╲
     0V ┤──────╱──────╲╱──────└─ Loud sound
        └──────────────────────→ Time

Silence = constant 1.65V = ADC reads ~32768 Sound = voltage swings above AND below baseline

Calculating Amplitude

Amplitude = distance from baseline, regardless of direction:

BASELINE = 32768  # Silence value

def get_amplitude(raw_sample):
    return abs(raw_sample - BASELINE)
Raw:       28000  32768  38000  25000
           (low)  (mid)  (high) (low)
             ↓      ↓      ↓      ↓
Amplitude:  4768     0   5232   7768
           (swing) (none) (swing) (bigger swing)

Smoothing

Single samples are noisy. Average a few for stability:

history = [0, 0, 0, 0, 0]  # Last 5 samples

def smoothed_amplitude(raw):
    amp = abs(raw - BASELINE)

    # Shift history, add new value
    history.pop(0)
    history.append(amp)

    # Return average
    return sum(history) // len(history)

Peak Detection: Finding the Beat

The Challenge

A clap or drum hit creates a brief spike in amplitude. We need to detect it!

Simple Threshold

THRESHOLD = 40  # 0-100 scale

def is_beat(level):
    return level > THRESHOLD

Problem: A single spike triggers multiple "beats" because level stays high across several samples!

Adding Cooldown

Ignore beats that happen too close together:

THRESHOLD = 40
COOLDOWN_MS = 100  # Minimum time between beats

last_beat_time = 0

def is_beat(level, current_time):
    global last_beat_time

    if level > THRESHOLD:
        time_since_last = current_time - last_beat_time

        if time_since_last > COOLDOWN_MS:
            last_beat_time = current_time
            return True

    return False

Adaptive Threshold

The Problem with Fixed Threshold

Different music, different volumes: - Quiet jazz → level barely reaches 20 - Loud EDM → level constantly above 60

Fixed threshold fails for both!

Solution: Adapt to Environment

Track the average level and trigger when significantly above it:

class AdaptiveDetector:
    def __init__(self, sensitivity=1.5):
        self.avg_level = 20        # Running average
        self.sensitivity = sensitivity  # 1.5 = 50% above average

    def update(self, level):
        # Slow-moving average (adapts over time)
        self.avg_level = 0.95 * self.avg_level + 0.05 * level

    def is_beat(self, level):
        self.update(level)
        threshold = self.avg_level * self.sensitivity
        return level > threshold
Quiet environment:
    Average: 15  →  Threshold: 22  →  Beat at level 25 ✓

Loud environment:
    Average: 50  →  Threshold: 75  →  Beat at level 80 ✓

Complete Signal Processing Pipeline

Microphone50 samples/secabs(raw - baseline)5-sample averageScale to 0-100level > threshold?100ms since last?Beat Detected! voltage0-655350-32768averaged0-100if aboveif ready
Microphone50 samples/secabs(raw - baseline)5-sample averageScale to 0-100level > threshold?100ms since last?Beat Detected! voltage0-655350-32768averaged0-100if aboveif ready

Timing Considerations

How Fast Is Fast Enough?

Stage Typical Time Notes
ADC read 5-10 µs Very fast
Amplitude calc 1-2 µs Simple math
Smoothing 2-5 µs Array operations
Beat detection 1-2 µs Comparisons
Total ~15 µs Can run 60,000+/sec!

Even running at 50 Hz (every 20ms), processing takes <0.1% of available time.

Latency

From sound to detection:

Source Latency Notes
Sound through air 3ms per meter Speed of sound
Microphone <1ms Negligible
ADC sampling 10ms average At 100 Hz sampling
Processing <1ms Fast calculations
LED response <1ms NeoPixel protocol
Total ~15-20ms Imperceptible to humans

Human perception threshold for audio-visual sync is ~50ms. We're well under!


Common Problems and Solutions

Problem: Missing Fast Beats

Symptom: Some beats not detected Cause: Cooldown too long, or sample rate too low Solution: Decrease cooldown, increase sample rate

Problem: Double Triggers

Symptom: One beat triggers multiple times Cause: Cooldown too short Solution: Increase cooldown to 100-150ms

Problem: False Triggers

Symptom: Triggers on non-beat sounds Cause: Threshold too low Solution: Increase threshold, or use adaptive detection

Problem: No Triggers at All

Symptom: Never detects beats Cause: Threshold too high, or mic not working Solution: Check raw values first, then lower threshold


Further Reading

Physics Connection: Frequency vs Amplitude
  • Amplitude = How loud (height of wave) — what we detect for beats
  • Frequency = Pitch (waves per second) — would need FFT to detect

Beat detection uses amplitude only, which is much simpler than frequency analysis!

Math Connection: Moving Average

The smoothing we use is a Simple Moving Average (SMA): $\(\bar{x}_n = \frac{1}{k} \sum_{i=0}^{k-1} x_{n-i}\)$

More advanced: Exponential Moving Average (EMA) gives more weight to recent samples: $\(\bar{x}_n = \alpha \cdot x_n + (1-\alpha) \cdot \bar{x}_{n-1}\)$