Machine Learning and Signal Processing for Embedded Systems
This reference explains how signal processing and machine learning connect — when to use classical DSP, when ML helps, and how to deploy both on resource-constrained hardware. Examples reference the course's audio and vision tutorials throughout.
Interactive Simulations (run on host PC)
Visual demos in scripts/signal-processing-demo/ — see each concept in action:
cd ~/embedded-linux/scripts/signal-processing-demo
pip3 install numpy matplotlib scipy scikit-learn # one-time
ML-focused demos:
-
python ml_decision_boundary.py -i— Watch SVM, Random Forest, and k-NN draw decision boundaries on 2D keystroke features. Drag the "Samples" slider to see how data quantity affects the boundary: at 10 samples, the boundary is random; at 200, it's stable. Switch to k-NN k=1 to see overfitting (the boundary passes through every point). -
python mel_spectrogram_explorer.py -i— Step through spectrogram construction: waveform → STFT frames → mel filterbank → mel spectrogram → normalized CNN input. Adjust FFT size (frequency resolution vs time resolution tradeoff) and mel bands (compression level). Load your own WAV files with--wav recording.wav.
DSP demos (see also Signal Processing Reference):
python sampling_aliasing.py -i— Nyquist theorem: watch aliasing appear below 2× signal frequencypython fft_windowing.py -i— Why Hann windowing reveals hidden tones that rectangle missespython filter_response.py -i— Design HP/LP filters, see frequency response + time domain effect
1. The Fundamental Pattern
Every embedded ML system follows the same pipeline, whether the input is audio, images, or sensor data:
┌──────────┐ ┌──────────────┐ ┌───────────────┐ ┌────────────┐ ┌──────────┐
│ Sensor │───▶│ Signal │───▶│ Feature │───▶│ Classifier │───▶│ Decision │
│ │ │ Conditioning │ │ Extraction │ │ / Model │ │ │
│ mic, │ │ filter, │ │ spectrogram, │ │ SVM, CNN, │ │ "key=a" │
│ camera, │ │ normalize, │ │ histogram, │ │ threshold │ │ "ball@ │
│ IMU │ │ denoise │ │ edges │ │ │ │ (x,y)" │
└──────────┘ └──────────────┘ └───────────────┘ └────────────┘ └──────────┘
The key insight: signal processing and ML are not alternatives — they're stages in the same pipeline. Signal processing prepares the data; ML makes the decision. The boundary between them is where engineering judgment matters most.
| Stage | What it does | Audio example | Vision example |
|---|---|---|---|
| Conditioning | Remove noise, normalize | HP filter, gain | White balance, denoise |
| Feature extraction | Convert raw signal to meaningful representation | Mel spectrogram | Color histogram, edges |
| Classification | Map features to categories | SVM on spectral features | Threshold on color mask |
Course tutorials that follow this pattern:
- Acoustic Keystroke Recognition — mic → HP filter → mel spectrogram → SVM/CNN → predicted key
- Ball Detection — camera → color conversion → threshold → contour detection → centroid
- I2S Audio Visualizer — mic → HP filter → FFT → spectrum display + GCC-PHAT → direction
2. When to Use What
The most common mistake in embedded ML is reaching for deep learning when classical DSP solves the problem. Here's a decision framework:
Is the relationship between
input and output KNOWN?
│
┌─────────┴──────────┐
│ YES │ NO
▼ ▼
Classical DSP Can you define
(filters, FFT, useful features
thresholds) manually?
│
┌────────┴────────┐
│ YES │ NO
▼ ▼
Classical ML Deep Learning
(SVM, RF, (CNN, RNN)
k-NN) learns features
automatically
Classical DSP — Known Relationships
Use when the physics of the problem tells you exactly what to compute:
| Problem | Solution | Why DSP works |
|---|---|---|
| Remove 50 Hz mains hum | Notch filter at 50 Hz | Frequency is known, deterministic |
| Detect audio onset | Energy threshold | Keystrokes are impulsive, high SNR |
| Find direction of sound | GCC-PHAT cross-correlation | Physics of wave propagation |
| Edge detection in image | Sobel/Canny filter | Mathematical definition of edge |
| Color segmentation | HSV threshold | Color boundaries are definable |
Advantages: Deterministic, explainable, no training data needed, runs on any hardware.
Course examples: - HP filter for DC removal (Audio Visualizer) - GCC-PHAT for TDOA (Audio Visualizer — TDOA section) - Color thresholding for ball detection (Ball Detection) - Canny edge detection (Camera Pipeline)
Classical ML — Known Features, Unknown Decision Boundary
Use when you can define good features but the classification rule is too complex to write by hand:
| Problem | Features | Classifier | Why ML helps |
|---|---|---|---|
| Keystroke recognition | Mel spectrogram | SVM | 27 classes, subtle spectral differences |
| Speaker identification | MFCCs | GMM | Vocal tract shapes vary continuously |
| Gesture recognition | Accelerometer statistics | Random Forest | Complex motion patterns |
| Fruit sorting | Color histogram + shape | SVM | Categories overlap in feature space |
Advantages: Works with small datasets (20-100 samples), fast inference, interpretable features.
Course example: Acoustic Keystroke — SVM classifier
Deep Learning — Unknown Features, Lots of Data
Use when you can't define good features, or when learned features outperform hand-crafted ones:
| Problem | Input | Model | Why DL helps |
|---|---|---|---|
| Keystroke recognition (high accuracy) | Raw mel spectrogram | CNN | Learns spectro-temporal patterns humans miss |
| Object detection | Raw image | YOLO, SSD | Millions of possible object appearances |
| Speech-to-text | Raw audio | Transformer | Language structure is too complex for rules |
| Anomaly detection | Sensor time series | Autoencoder | Normal behavior is hard to define |
Advantages: Highest accuracy ceiling, learns features automatically, handles complex patterns. Disadvantages: Needs lots of data, expensive to train, hard to explain, larger models.
Course example: Acoustic Keystroke — CNN
3. Signal Processing as Feature Engineering
The quality of features determines the accuracy ceiling of any classical ML system. Here's how signal processing creates good features for different sensor types:
3.1 Audio Features
| Feature | What it captures | Computed from | Used for |
|---|---|---|---|
| Mel spectrogram | Frequency content over time | STFT + mel filterbank | Keystroke ID, speech recognition |
| MFCCs | Decorrelated spectral shape | DCT of mel spectrogram | Speaker ID, phoneme classification |
| Spectral centroid | "Brightness" of sound | Weighted mean frequency | Instrument classification |
| Zero crossing rate | Noisiness vs tonality | Sign changes per frame | Speech vs music detection |
| Chroma features | Musical pitch class | FFT bins mapped to 12 notes | Music analysis |
| RMS energy | Loudness over time | Mean squared amplitude | Onset detection, VAD |
The mel spectrogram is the most versatile audio feature. It captures both what frequencies are present (spectral shape) and how they evolve (temporal dynamics). The mel scale compresses high frequencies where human perception and most physical phenomena have less detail:
See Signal Processing Reference for filter and FFT theory.
3.2 Image Features
| Feature | What it captures | Computed from | Used for |
|---|---|---|---|
| Color histogram | Color distribution | Pixel binning in HSV/RGB | Object detection by color |
| HOG (Histogram of Oriented Gradients) | Shape/edge structure | Gradient magnitudes in cells | Pedestrian detection, OCR |
| SIFT/SURF | Scale-invariant keypoints | Difference of Gaussians | Object matching, panorama stitching |
| LBP (Local Binary Patterns) | Texture | Pixel neighborhood comparison | Face recognition, surface inspection |
| Contour moments | Shape statistics | Contour pixel coordinates | Object classification |
| Optical flow | Motion between frames | Lucas-Kanade or Farneback | Gesture recognition, tracking |
Color histograms are the image equivalent of audio spectrograms — they summarize what's present without encoding where. The Ball Detection tutorial uses HSV color space because it separates color (hue) from lighting (value), making detection robust to shadows.
3.3 The Feature Engineering → Deep Learning Transition
A key insight: CNNs learn to compute their own features. The first layers of a CNN trained on spectrograms learn filters that resemble mel-scale frequency bands. The first layers of a CNN trained on images learn edge detectors similar to Sobel filters.
Hand-crafted pipeline: CNN pipeline:
Audio → Mel spec → SVM Audio → Mel spec → CNN
Image → HOG → SVM Image → Raw pixels → CNN
▲ ▲
You design this Network learns this
(domain knowledge) (from data)
This means: - With small data, hand-crafted features + classical ML wins (your domain knowledge compensates for limited examples) - With large data, CNN wins (it discovers features you wouldn't think of) - Hybrid approach often works best: use domain knowledge for preprocessing (mel scale, normalization), let the CNN learn the rest
4. Common ML Problems in Embedded Systems
4.1 Classification
What: Assign input to one of N categories. Examples: Keystroke → letter, image → object class, vibration → fault type. Models: SVM, Random Forest, CNN.
4.2 Detection
What: Find and locate objects in a stream. Examples: Ball position in image, keystroke onset in audio, face in video frame. Key challenge: Detection combines "is it there?" (classification) with "where?" (localization).
Audio detection is typically onset-based — detect energy spikes, then classify the segment after the onset. See Acoustic Keystroke — Onset Detection.
Image detection uses either: - Classical: threshold → contour → centroid (fast, works for simple scenes). See Ball Detection. - Deep learning: YOLO, SSD (handles complex scenes, multiple objects, occlusion). Requires GPU or NPU for real-time.
4.3 Regression
What: Predict a continuous value. Examples: Sound direction angle, ball position coordinates, temperature prediction. Models: Linear regression, neural network with linear output.
The Audio Visualizer's TDOA is effectively regression: cross-correlation → delay → angle. It's computed with DSP (no ML needed) because the physics is well-understood.
4.4 Anomaly Detection
What: Detect when something is "unusual" without defining what's normal. Examples: Machine vibration anomaly, unusual network traffic, production quality defect. Models: Autoencoder (learns to reconstruct "normal" — anomalies have high reconstruction error), one-class SVM, isolation forest.
Embedded approach: Train on "normal" data only (easy to collect), deploy on Pi. When reconstruction error exceeds threshold → alert. No need to label fault types.
5. Deployment on Embedded Hardware
5.1 Inference Frameworks
| Framework | Target | Model format | Quantization | Best for |
|---|---|---|---|---|
| scikit-learn (Python) | Pi, any Linux | .pkl | No | SVM, RF, classical ML |
| ONNX Runtime | Pi, x86, ARM | .onnx | FP16, INT8 | PyTorch/sklearn models on Pi |
| TFLite | Pi, Android | .tflite | INT8, FP16 | TensorFlow/Keras models |
| TFLite Micro | MCU (ESP32, STM32) | .tflite (INT8) | INT8 only | Tiny models on microcontrollers |
| OpenCV DNN | Pi, x86 | .onnx, .pb | No | Vision models with OpenCV |
5.2 Model Size and Latency Budget
Raspberry Pi 4 (1.5 GHz Cortex-A72, 4 cores):
scikit-learn SVM: < 1 ms inference, ~2 MB model
ONNX small CNN: ~3 ms inference, ~0.5 MB model
ONNX MobileNetV2: ~50 ms inference, ~14 MB model
YOLOv5n (detection): ~200 ms inference, ~4 MB model
Raspberry Pi Zero 2 W (1 GHz Cortex-A53):
Everything ~3x slower
ESP32 (240 MHz, 520 KB RAM):
TFLite Micro INT8: ~50 ms for tiny model, < 100 KB model
No Python, no OS
Rule of thumb: If you need < 10 ms inference on Pi → classical ML or tiny CNN. If you need real-time video (30 fps) → need a Pi 5 with NPU, or offload to a Coral/Jetson accelerator.
5.3 Quantization
Reducing model precision from FP32 to INT8 shrinks the model 4x and speeds up inference 2-4x on ARM:
FP32: each weight = 32 bits → full precision, large model
FP16: each weight = 16 bits → ~same accuracy, 2x smaller
INT8: each weight = 8 bits → slight accuracy loss, 4x smaller, 2-4x faster
For our keystroke CNN (32×50 input, 2 conv layers, ~50K parameters): - FP32: 200 KB, ~3 ms on Pi 4 - INT8: 50 KB, ~1 ms on Pi 4
The accuracy loss from INT8 is typically < 1% for well-trained models.
6. Audio ML vs Vision ML — Parallel Concepts
Students often learn audio and vision ML separately, but the concepts map directly:
| Concept | Audio domain | Vision domain |
|---|---|---|
| Raw signal | Waveform (1D, time) | Image (2D, spatial) |
| Frequency decomposition | FFT / STFT | 2D FFT / wavelets |
| Perceptual transform | Mel scale (mimics ear) | Color spaces (HSV, LAB) |
| Standard feature | Mel spectrogram | HOG, color histogram |
| Learned feature | CNN on spectrogram | CNN on image |
| Onset/detection | Energy threshold | Edge/contour detection |
| Segmentation | Voice activity detection | Image segmentation |
| Time series | Audio frames → RNN/LSTM | Video frames → 3D CNN |
| Transfer learning | AudioSet pretrained | ImageNet pretrained |
| Data augmentation | Time shift, noise, gain | Flip, rotate, crop, color jitter |
| Noise removal | Spectral subtraction | Gaussian blur, median filter |
| Real-time constraint | Period budget (21 ms) | Frame budget (33 ms at 30 fps) |
The Spectrogram Is an Image
This is perhaps the most important conceptual bridge: a mel spectrogram IS an image. That's why image CNNs work on audio — the 2D convolution operates on frequency × time, which is structurally identical to height × width.
Audio mel spectrogram: Image:
┌─────────────────┐ ┌─────────────────┐
│ ▓▓░░░░░░░░░░░░ │ │ ░░▓▓▓░░░░░░░░░ │
│ ▓▓▓░░░░░░░░░░░ │ ←same→ │ ░▓▓▓▓▓░░░░░░░ │
│ ▓▓▓▓░░░░░░░░░░ │ math │ ▓▓▓▓▓▓▓░░░░░░ │
│ ░▓▓▓▓░░░░░░░░░ │ │ ░▓▓▓▓▓░░░░░░░ │
└─────────────────┘ └─────────────────┘
freq ↑ time → y ↑ x →
A 3×3 convolution kernel on a spectrogram detects a pattern spanning 3 frequency bands over 3 time frames. On an image, the same kernel detects a pattern spanning 3 pixels vertically and 3 horizontally.
7. Practical Workflow for Embedded ML Projects
1. START WITH DSP
└─ Can you solve it with filters, thresholds, correlations?
└─ YES → Done. Ship it. (lowest complexity, most reliable)
└─ NO → Continue
2. TRY CLASSICAL ML
└─ Define features from domain knowledge
└─ Collect 20-50 labeled examples
└─ Train SVM or RF, evaluate with cross-validation
└─ Accuracy good enough?
└─ YES → Done. Export model, deploy.
└─ NO → Continue
3. COLLECT MORE DATA
└─ Before adding model complexity, try more data first
└─ 50 → 200 samples often improves accuracy by 10-15%
└─ Still not enough?
└─ Continue
4. TRY CNN / DEEP LEARNING
└─ Use spectrogram or raw image as input
└─ Train on host with PyTorch/TensorFlow
└─ Export to ONNX/TFLite for Pi deployment
└─ Measure inference latency — fits in real-time budget?
└─ YES → Deploy
└─ NO → Quantize (INT8), prune, or use smaller architecture
5. OPTIMIZE FOR DEPLOYMENT
└─ Quantization: FP32 → INT8 (4x smaller, 2-4x faster)
└─ Pruning: remove small weights (30-50% smaller)
└─ Knowledge distillation: train small model to mimic large one
└─ Hardware acceleration: NPU, Coral TPU, GPU
Warning
The most common failure mode in embedded ML projects is starting at step 4. Students spend weeks training a CNN when a well-tuned threshold would have worked. Always validate the simpler approach first — it's often "good enough" and 100x simpler to deploy and maintain.
Further Reading
Textbooks: - Pete Warden & Daniel Situnayake, TinyML (O'Reilly) — the definitive guide to ML on embedded devices - Aurélien Géron, Hands-On Machine Learning (O'Reilly) — excellent ML/DL introduction with scikit-learn and TensorFlow
Courses: - Andrew Ng, Machine Learning Specialization — foundational ML concepts - 3Blue1Brown, Neural Networks — visual intuition for backpropagation - Fast.ai, Practical Deep Learning — hands-on CNN training
Deployment: - ONNX Runtime — cross-platform inference - TFLite — TensorFlow for mobile and embedded - Edge Impulse — end-to-end embedded ML platform (free tier)
Course tutorials that apply these concepts: - Acoustic Keystroke Recognition — full audio ML pipeline from feature extraction to CNN deployment - Ball Detection — classical vision pipeline with OpenCV - Camera Pipeline — capture, process, display - I2S Audio Visualizer — real-time DSP pipeline - Signal Processing Reference — sampling, FFT, filtering foundations - Audio Pipeline Latency — real-time constraints for ML inference