Machine Learning and Signal Processing for Embedded Systems

This reference explains how signal processing and machine learning connect — when to use classical DSP, when ML helps, and how to deploy both on resource-constrained hardware. Examples reference the course's audio and vision tutorials throughout.

Interactive Simulations (run on host PC)

Visual demos in scripts/signal-processing-demo/ — see each concept in action:

cd ~/embedded-linux/scripts/signal-processing-demo
pip3 install numpy matplotlib scipy scikit-learn  # one-time

ML-focused demos:

python ml_decision_boundary.py -i — Watch SVM, Random Forest, and k-NN draw decision boundaries on 2D keystroke features. Drag the "Samples" slider to see how data quantity affects the boundary: at 10 samples, the boundary is random; at 200, it's stable. Switch to k-NN k=1 to see overfitting (the boundary passes through every point).
python mel_spectrogram_explorer.py -i — Step through spectrogram construction: waveform → STFT frames → mel filterbank → mel spectrogram → normalized CNN input. Adjust FFT size (frequency resolution vs time resolution tradeoff) and mel bands (compression level). Load your own WAV files with --wav recording.wav.

DSP demos (see also Signal Processing Reference):

python sampling_aliasing.py -i — Nyquist theorem: watch aliasing appear below 2× signal frequency
python fft_windowing.py -i — Why Hann windowing reveals hidden tones that rectangle misses
python filter_response.py -i — Design HP/LP filters, see frequency response + time domain effect

1. The Fundamental Pattern

Every embedded ML system follows the same pipeline, whether the input is audio, images, or sensor data:

┌──────────┐    ┌──────────────┐    ┌───────────────┐    ┌────────────┐    ┌──────────┐
│ Sensor   │───▶│ Signal       │───▶│ Feature       │───▶│ Classifier │───▶│ Decision │
│          │    │ Conditioning │    │ Extraction    │    │ / Model    │    │          │
│ mic,     │    │ filter,      │    │ spectrogram,  │    │ SVM, CNN,  │    │ "key=a"  │
│ camera,  │    │ normalize,   │    │ histogram,    │    │ threshold  │    │ "ball@   │
│ IMU      │    │ denoise      │    │ edges         │    │            │    │  (x,y)"  │
└──────────┘    └──────────────┘    └───────────────┘    └────────────┘    └──────────┘

The key insight: signal processing and ML are not alternatives — they're stages in the same pipeline. Signal processing prepares the data; ML makes the decision. The boundary between them is where engineering judgment matters most.

Stage	What it does	Audio example	Vision example
Conditioning	Remove noise, normalize	HP filter, gain	White balance, denoise
Feature extraction	Convert raw signal to meaningful representation	Mel spectrogram	Color histogram, edges
Classification	Map features to categories	SVM on spectral features	Threshold on color mask

Course tutorials that follow this pattern:

Acoustic Keystroke Recognition — mic → HP filter → mel spectrogram → SVM/CNN → predicted key
Ball Detection — camera → color conversion → threshold → contour detection → centroid
I2S Audio Visualizer — mic → HP filter → FFT → spectrum display + GCC-PHAT → direction

2. When to Use What

The most common mistake in embedded ML is reaching for deep learning when classical DSP solves the problem. Here's a decision framework:

                        Is the relationship between
                        input and output KNOWN?
                              │
                    ┌─────────┴──────────┐
                    │ YES                │ NO
                    ▼                    ▼
              Classical DSP         Can you define
              (filters, FFT,        useful features
               thresholds)          manually?
                                         │
                                ┌────────┴────────┐
                                │ YES             │ NO
                                ▼                 ▼
                          Classical ML        Deep Learning
                          (SVM, RF,           (CNN, RNN)
                           k-NN)              learns features
                                              automatically

Classical DSP — Known Relationships

Use when the physics of the problem tells you exactly what to compute:

Problem	Solution	Why DSP works
Remove 50 Hz mains hum	Notch filter at 50 Hz	Frequency is known, deterministic
Detect audio onset	Energy threshold	Keystrokes are impulsive, high SNR
Find direction of sound	GCC-PHAT cross-correlation	Physics of wave propagation
Edge detection in image	Sobel/Canny filter	Mathematical definition of edge
Color segmentation	HSV threshold	Color boundaries are definable

Advantages: Deterministic, explainable, no training data needed, runs on any hardware.

Course examples: - HP filter for DC removal (Audio Visualizer) - GCC-PHAT for TDOA (Audio Visualizer — TDOA section) - Color thresholding for ball detection (Ball Detection) - Canny edge detection (Camera Pipeline)

Classical ML — Known Features, Unknown Decision Boundary

Use when you can define good features but the classification rule is too complex to write by hand:

Problem	Features	Classifier	Why ML helps
Keystroke recognition	Mel spectrogram	SVM	27 classes, subtle spectral differences
Speaker identification	MFCCs	GMM	Vocal tract shapes vary continuously
Gesture recognition	Accelerometer statistics	Random Forest	Complex motion patterns
Fruit sorting	Color histogram + shape	SVM	Categories overlap in feature space

Advantages: Works with small datasets (20-100 samples), fast inference, interpretable features.

Course example: Acoustic Keystroke — SVM classifier

Deep Learning — Unknown Features, Lots of Data

Use when you can't define good features, or when learned features outperform hand-crafted ones:

Problem	Input	Model	Why DL helps
Keystroke recognition (high accuracy)	Raw mel spectrogram	CNN	Learns spectro-temporal patterns humans miss
Object detection	Raw image	YOLO, SSD	Millions of possible object appearances
Speech-to-text	Raw audio	Transformer	Language structure is too complex for rules
Anomaly detection	Sensor time series	Autoencoder	Normal behavior is hard to define

Advantages: Highest accuracy ceiling, learns features automatically, handles complex patterns. Disadvantages: Needs lots of data, expensive to train, hard to explain, larger models.

Course example: Acoustic Keystroke — CNN

3. Signal Processing as Feature Engineering

The quality of features determines the accuracy ceiling of any classical ML system. Here's how signal processing creates good features for different sensor types:

3.1 Audio Features

Feature	What it captures	Computed from	Used for
Mel spectrogram	Frequency content over time	STFT + mel filterbank	Keystroke ID, speech recognition
MFCCs	Decorrelated spectral shape	DCT of mel spectrogram	Speaker ID, phoneme classification
Spectral centroid	"Brightness" of sound	Weighted mean frequency	Instrument classification
Zero crossing rate	Noisiness vs tonality	Sign changes per frame	Speech vs music detection
Chroma features	Musical pitch class	FFT bins mapped to 12 notes	Music analysis
RMS energy	Loudness over time	Mean squared amplitude	Onset detection, VAD

The mel spectrogram is the most versatile audio feature. It captures both what frequencies are present (spectral shape) and how they evolve (temporal dynamics). The mel scale compresses high frequencies where human perception and most physical phenomena have less detail:

\[\text{mel}(f) = 2595 \cdot \log_{10}\left(1 + \frac{f}{700}\right)\]

See Signal Processing Reference for filter and FFT theory.

3.2 Image Features

Feature	What it captures	Computed from	Used for
Color histogram	Color distribution	Pixel binning in HSV/RGB	Object detection by color
HOG (Histogram of Oriented Gradients)	Shape/edge structure	Gradient magnitudes in cells	Pedestrian detection, OCR
SIFT/SURF	Scale-invariant keypoints	Difference of Gaussians	Object matching, panorama stitching
LBP (Local Binary Patterns)	Texture	Pixel neighborhood comparison	Face recognition, surface inspection
Contour moments	Shape statistics	Contour pixel coordinates	Object classification
Optical flow	Motion between frames	Lucas-Kanade or Farneback	Gesture recognition, tracking

Color histograms are the image equivalent of audio spectrograms — they summarize what's present without encoding where. The Ball Detection tutorial uses HSV color space because it separates color (hue) from lighting (value), making detection robust to shadows.

3.3 The Feature Engineering → Deep Learning Transition

A key insight: CNNs learn to compute their own features. The first layers of a CNN trained on spectrograms learn filters that resemble mel-scale frequency bands. The first layers of a CNN trained on images learn edge detectors similar to Sobel filters.

Hand-crafted pipeline:            CNN pipeline:
  Audio → Mel spec → SVM           Audio → Mel spec → CNN
  Image → HOG → SVM                Image → Raw pixels → CNN
           ▲                                    ▲
     You design this              Network learns this
     (domain knowledge)           (from data)

This means: - With small data, hand-crafted features + classical ML wins (your domain knowledge compensates for limited examples) - With large data, CNN wins (it discovers features you wouldn't think of) - Hybrid approach often works best: use domain knowledge for preprocessing (mel scale, normalization), let the CNN learn the rest

4. Common ML Problems in Embedded Systems

4.1 Classification

What: Assign input to one of N categories. Examples: Keystroke → letter, image → object class, vibration → fault type. Models: SVM, Random Forest, CNN.

4.2 Detection

What: Find and locate objects in a stream. Examples: Ball position in image, keystroke onset in audio, face in video frame. Key challenge: Detection combines "is it there?" (classification) with "where?" (localization).

Audio detection is typically onset-based — detect energy spikes, then classify the segment after the onset. See Acoustic Keystroke — Onset Detection.

Image detection uses either: - Classical: threshold → contour → centroid (fast, works for simple scenes). See Ball Detection. - Deep learning: YOLO, SSD (handles complex scenes, multiple objects, occlusion). Requires GPU or NPU for real-time.

4.3 Regression

What: Predict a continuous value. Examples: Sound direction angle, ball position coordinates, temperature prediction. Models: Linear regression, neural network with linear output.

The Audio Visualizer's TDOA is effectively regression: cross-correlation → delay → angle. It's computed with DSP (no ML needed) because the physics is well-understood.

4.4 Anomaly Detection

What: Detect when something is "unusual" without defining what's normal. Examples: Machine vibration anomaly, unusual network traffic, production quality defect. Models: Autoencoder (learns to reconstruct "normal" — anomalies have high reconstruction error), one-class SVM, isolation forest.

Embedded approach: Train on "normal" data only (easy to collect), deploy on Pi. When reconstruction error exceeds threshold → alert. No need to label fault types.

5. Deployment on Embedded Hardware

5.1 Inference Frameworks

Framework	Target	Model format	Quantization	Best for
scikit-learn (Python)	Pi, any Linux	.pkl	No	SVM, RF, classical ML
ONNX Runtime	Pi, x86, ARM	.onnx	FP16, INT8	PyTorch/sklearn models on Pi
TFLite	Pi, Android	.tflite	INT8, FP16	TensorFlow/Keras models
TFLite Micro	MCU (ESP32, STM32)	.tflite (INT8)	INT8 only	Tiny models on microcontrollers
OpenCV DNN	Pi, x86	.onnx, .pb	No	Vision models with OpenCV

5.2 Model Size and Latency Budget

Raspberry Pi 4 (1.5 GHz Cortex-A72, 4 cores):

  scikit-learn SVM:     < 1 ms inference, ~2 MB model
  ONNX small CNN:       ~3 ms inference, ~0.5 MB model
  ONNX MobileNetV2:     ~50 ms inference, ~14 MB model
  YOLOv5n (detection):  ~200 ms inference, ~4 MB model

Raspberry Pi Zero 2 W (1 GHz Cortex-A53):
  Everything ~3x slower

ESP32 (240 MHz, 520 KB RAM):
  TFLite Micro INT8:    ~50 ms for tiny model, < 100 KB model
  No Python, no OS

Rule of thumb: If you need < 10 ms inference on Pi → classical ML or tiny CNN. If you need real-time video (30 fps) → need a Pi 5 with NPU, or offload to a Coral/Jetson accelerator.

5.3 Quantization

Reducing model precision from FP32 to INT8 shrinks the model 4x and speeds up inference 2-4x on ARM:

FP32:  each weight = 32 bits → full precision, large model
FP16:  each weight = 16 bits → ~same accuracy, 2x smaller
INT8:  each weight =  8 bits → slight accuracy loss, 4x smaller, 2-4x faster

For our keystroke CNN (32×50 input, 2 conv layers, ~50K parameters): - FP32: 200 KB, ~3 ms on Pi 4 - INT8: 50 KB, ~1 ms on Pi 4

The accuracy loss from INT8 is typically < 1% for well-trained models.

6. Audio ML vs Vision ML — Parallel Concepts

Students often learn audio and vision ML separately, but the concepts map directly:

Concept	Audio domain	Vision domain
Raw signal	Waveform (1D, time)	Image (2D, spatial)
Frequency decomposition	FFT / STFT	2D FFT / wavelets
Perceptual transform	Mel scale (mimics ear)	Color spaces (HSV, LAB)
Standard feature	Mel spectrogram	HOG, color histogram
Learned feature	CNN on spectrogram	CNN on image
Onset/detection	Energy threshold	Edge/contour detection
Segmentation	Voice activity detection	Image segmentation
Time series	Audio frames → RNN/LSTM	Video frames → 3D CNN
Transfer learning	AudioSet pretrained	ImageNet pretrained
Data augmentation	Time shift, noise, gain	Flip, rotate, crop, color jitter
Noise removal	Spectral subtraction	Gaussian blur, median filter
Real-time constraint	Period budget (21 ms)	Frame budget (33 ms at 30 fps)

The Spectrogram Is an Image

This is perhaps the most important conceptual bridge: a mel spectrogram IS an image. That's why image CNNs work on audio — the 2D convolution operates on frequency × time, which is structurally identical to height × width.

Audio mel spectrogram:          Image:
  ┌─────────────────┐          ┌─────────────────┐
  │ ▓▓░░░░░░░░░░░░  │          │ ░░▓▓▓░░░░░░░░░  │
  │ ▓▓▓░░░░░░░░░░░  │  ←same→  │ ░▓▓▓▓▓░░░░░░░  │
  │ ▓▓▓▓░░░░░░░░░░  │  math    │ ▓▓▓▓▓▓▓░░░░░░  │
  │ ░▓▓▓▓░░░░░░░░░  │          │ ░▓▓▓▓▓░░░░░░░  │
  └─────────────────┘          └─────────────────┘
  freq ↑   time →              y ↑   x →

A 3×3 convolution kernel on a spectrogram detects a pattern spanning 3 frequency bands over 3 time frames. On an image, the same kernel detects a pattern spanning 3 pixels vertically and 3 horizontally.

7. Practical Workflow for Embedded ML Projects

1. START WITH DSP
   └─ Can you solve it with filters, thresholds, correlations?
      └─ YES → Done. Ship it. (lowest complexity, most reliable)
      └─ NO → Continue

2. TRY CLASSICAL ML
   └─ Define features from domain knowledge
   └─ Collect 20-50 labeled examples
   └─ Train SVM or RF, evaluate with cross-validation
   └─ Accuracy good enough?
      └─ YES → Done. Export model, deploy.
      └─ NO → Continue

3. COLLECT MORE DATA
   └─ Before adding model complexity, try more data first
   └─ 50 → 200 samples often improves accuracy by 10-15%
   └─ Still not enough?
      └─ Continue

4. TRY CNN / DEEP LEARNING
   └─ Use spectrogram or raw image as input
   └─ Train on host with PyTorch/TensorFlow
   └─ Export to ONNX/TFLite for Pi deployment
   └─ Measure inference latency — fits in real-time budget?
      └─ YES → Deploy
      └─ NO → Quantize (INT8), prune, or use smaller architecture

5. OPTIMIZE FOR DEPLOYMENT
   └─ Quantization: FP32 → INT8 (4x smaller, 2-4x faster)
   └─ Pruning: remove small weights (30-50% smaller)
   └─ Knowledge distillation: train small model to mimic large one
   └─ Hardware acceleration: NPU, Coral TPU, GPU

Warning

The most common failure mode in embedded ML projects is starting at step 4. Students spend weeks training a CNN when a well-tuned threshold would have worked. Always validate the simpler approach first — it's often "good enough" and 100x simpler to deploy and maintain.