Skip to content

Machine Learning and Signal Processing for Embedded Systems

This reference explains how signal processing and machine learning connect — when to use classical DSP, when ML helps, and how to deploy both on resource-constrained hardware. Examples reference the course's audio and vision tutorials throughout.

Interactive Simulations (run on host PC)

Visual demos in scripts/signal-processing-demo/ — see each concept in action:

cd ~/embedded-linux/scripts/signal-processing-demo
pip3 install numpy matplotlib scipy scikit-learn  # one-time

ML-focused demos:

  • python ml_decision_boundary.py -i — Watch SVM, Random Forest, and k-NN draw decision boundaries on 2D keystroke features. Drag the "Samples" slider to see how data quantity affects the boundary: at 10 samples, the boundary is random; at 200, it's stable. Switch to k-NN k=1 to see overfitting (the boundary passes through every point).

  • python mel_spectrogram_explorer.py -i — Step through spectrogram construction: waveform → STFT frames → mel filterbank → mel spectrogram → normalized CNN input. Adjust FFT size (frequency resolution vs time resolution tradeoff) and mel bands (compression level). Load your own WAV files with --wav recording.wav.

DSP demos (see also Signal Processing Reference):

  • python sampling_aliasing.py -i — Nyquist theorem: watch aliasing appear below 2× signal frequency
  • python fft_windowing.py -i — Why Hann windowing reveals hidden tones that rectangle misses
  • python filter_response.py -i — Design HP/LP filters, see frequency response + time domain effect

1. The Fundamental Pattern

Every embedded ML system follows the same pipeline, whether the input is audio, images, or sensor data:

┌──────────┐    ┌──────────────┐    ┌───────────────┐    ┌────────────┐    ┌──────────┐
│ Sensor   │───▶│ Signal       │───▶│ Feature       │───▶│ Classifier │───▶│ Decision │
│          │    │ Conditioning │    │ Extraction    │    │ / Model    │    │          │
│ mic,     │    │ filter,      │    │ spectrogram,  │    │ SVM, CNN,  │    │ "key=a"  │
│ camera,  │    │ normalize,   │    │ histogram,    │    │ threshold  │    │ "ball@   │
│ IMU      │    │ denoise      │    │ edges         │    │            │    │  (x,y)"  │
└──────────┘    └──────────────┘    └───────────────┘    └────────────┘    └──────────┘

The key insight: signal processing and ML are not alternatives — they're stages in the same pipeline. Signal processing prepares the data; ML makes the decision. The boundary between them is where engineering judgment matters most.

Stage What it does Audio example Vision example
Conditioning Remove noise, normalize HP filter, gain White balance, denoise
Feature extraction Convert raw signal to meaningful representation Mel spectrogram Color histogram, edges
Classification Map features to categories SVM on spectral features Threshold on color mask

Course tutorials that follow this pattern:


2. When to Use What

The most common mistake in embedded ML is reaching for deep learning when classical DSP solves the problem. Here's a decision framework:

                        Is the relationship between
                        input and output KNOWN?
                    ┌─────────┴──────────┐
                    │ YES                │ NO
                    ▼                    ▼
              Classical DSP         Can you define
              (filters, FFT,        useful features
               thresholds)          manually?
                                ┌────────┴────────┐
                                │ YES             │ NO
                                ▼                 ▼
                          Classical ML        Deep Learning
                          (SVM, RF,           (CNN, RNN)
                           k-NN)              learns features
                                              automatically

Classical DSP — Known Relationships

Use when the physics of the problem tells you exactly what to compute:

Problem Solution Why DSP works
Remove 50 Hz mains hum Notch filter at 50 Hz Frequency is known, deterministic
Detect audio onset Energy threshold Keystrokes are impulsive, high SNR
Find direction of sound GCC-PHAT cross-correlation Physics of wave propagation
Edge detection in image Sobel/Canny filter Mathematical definition of edge
Color segmentation HSV threshold Color boundaries are definable

Advantages: Deterministic, explainable, no training data needed, runs on any hardware.

Course examples: - HP filter for DC removal (Audio Visualizer) - GCC-PHAT for TDOA (Audio Visualizer — TDOA section) - Color thresholding for ball detection (Ball Detection) - Canny edge detection (Camera Pipeline)

Classical ML — Known Features, Unknown Decision Boundary

Use when you can define good features but the classification rule is too complex to write by hand:

Problem Features Classifier Why ML helps
Keystroke recognition Mel spectrogram SVM 27 classes, subtle spectral differences
Speaker identification MFCCs GMM Vocal tract shapes vary continuously
Gesture recognition Accelerometer statistics Random Forest Complex motion patterns
Fruit sorting Color histogram + shape SVM Categories overlap in feature space

Advantages: Works with small datasets (20-100 samples), fast inference, interpretable features.

Course example: Acoustic Keystroke — SVM classifier

Deep Learning — Unknown Features, Lots of Data

Use when you can't define good features, or when learned features outperform hand-crafted ones:

Problem Input Model Why DL helps
Keystroke recognition (high accuracy) Raw mel spectrogram CNN Learns spectro-temporal patterns humans miss
Object detection Raw image YOLO, SSD Millions of possible object appearances
Speech-to-text Raw audio Transformer Language structure is too complex for rules
Anomaly detection Sensor time series Autoencoder Normal behavior is hard to define

Advantages: Highest accuracy ceiling, learns features automatically, handles complex patterns. Disadvantages: Needs lots of data, expensive to train, hard to explain, larger models.

Course example: Acoustic Keystroke — CNN


3. Signal Processing as Feature Engineering

The quality of features determines the accuracy ceiling of any classical ML system. Here's how signal processing creates good features for different sensor types:

3.1 Audio Features

Feature What it captures Computed from Used for
Mel spectrogram Frequency content over time STFT + mel filterbank Keystroke ID, speech recognition
MFCCs Decorrelated spectral shape DCT of mel spectrogram Speaker ID, phoneme classification
Spectral centroid "Brightness" of sound Weighted mean frequency Instrument classification
Zero crossing rate Noisiness vs tonality Sign changes per frame Speech vs music detection
Chroma features Musical pitch class FFT bins mapped to 12 notes Music analysis
RMS energy Loudness over time Mean squared amplitude Onset detection, VAD

The mel spectrogram is the most versatile audio feature. It captures both what frequencies are present (spectral shape) and how they evolve (temporal dynamics). The mel scale compresses high frequencies where human perception and most physical phenomena have less detail:

\[\text{mel}(f) = 2595 \cdot \log_{10}\left(1 + \frac{f}{700}\right)\]

See Signal Processing Reference for filter and FFT theory.

3.2 Image Features

Feature What it captures Computed from Used for
Color histogram Color distribution Pixel binning in HSV/RGB Object detection by color
HOG (Histogram of Oriented Gradients) Shape/edge structure Gradient magnitudes in cells Pedestrian detection, OCR
SIFT/SURF Scale-invariant keypoints Difference of Gaussians Object matching, panorama stitching
LBP (Local Binary Patterns) Texture Pixel neighborhood comparison Face recognition, surface inspection
Contour moments Shape statistics Contour pixel coordinates Object classification
Optical flow Motion between frames Lucas-Kanade or Farneback Gesture recognition, tracking

Color histograms are the image equivalent of audio spectrograms — they summarize what's present without encoding where. The Ball Detection tutorial uses HSV color space because it separates color (hue) from lighting (value), making detection robust to shadows.

3.3 The Feature Engineering → Deep Learning Transition

A key insight: CNNs learn to compute their own features. The first layers of a CNN trained on spectrograms learn filters that resemble mel-scale frequency bands. The first layers of a CNN trained on images learn edge detectors similar to Sobel filters.

Hand-crafted pipeline:            CNN pipeline:
  Audio → Mel spec → SVM           Audio → Mel spec → CNN
  Image → HOG → SVM                Image → Raw pixels → CNN
           ▲                                    ▲
     You design this              Network learns this
     (domain knowledge)           (from data)

This means: - With small data, hand-crafted features + classical ML wins (your domain knowledge compensates for limited examples) - With large data, CNN wins (it discovers features you wouldn't think of) - Hybrid approach often works best: use domain knowledge for preprocessing (mel scale, normalization), let the CNN learn the rest


4. Common ML Problems in Embedded Systems

4.1 Classification

What: Assign input to one of N categories. Examples: Keystroke → letter, image → object class, vibration → fault type. Models: SVM, Random Forest, CNN.

4.2 Detection

What: Find and locate objects in a stream. Examples: Ball position in image, keystroke onset in audio, face in video frame. Key challenge: Detection combines "is it there?" (classification) with "where?" (localization).

Audio detection is typically onset-based — detect energy spikes, then classify the segment after the onset. See Acoustic Keystroke — Onset Detection.

Image detection uses either: - Classical: threshold → contour → centroid (fast, works for simple scenes). See Ball Detection. - Deep learning: YOLO, SSD (handles complex scenes, multiple objects, occlusion). Requires GPU or NPU for real-time.

4.3 Regression

What: Predict a continuous value. Examples: Sound direction angle, ball position coordinates, temperature prediction. Models: Linear regression, neural network with linear output.

The Audio Visualizer's TDOA is effectively regression: cross-correlation → delay → angle. It's computed with DSP (no ML needed) because the physics is well-understood.

4.4 Anomaly Detection

What: Detect when something is "unusual" without defining what's normal. Examples: Machine vibration anomaly, unusual network traffic, production quality defect. Models: Autoencoder (learns to reconstruct "normal" — anomalies have high reconstruction error), one-class SVM, isolation forest.

Embedded approach: Train on "normal" data only (easy to collect), deploy on Pi. When reconstruction error exceeds threshold → alert. No need to label fault types.


5. Deployment on Embedded Hardware

5.1 Inference Frameworks

Framework Target Model format Quantization Best for
scikit-learn (Python) Pi, any Linux .pkl No SVM, RF, classical ML
ONNX Runtime Pi, x86, ARM .onnx FP16, INT8 PyTorch/sklearn models on Pi
TFLite Pi, Android .tflite INT8, FP16 TensorFlow/Keras models
TFLite Micro MCU (ESP32, STM32) .tflite (INT8) INT8 only Tiny models on microcontrollers
OpenCV DNN Pi, x86 .onnx, .pb No Vision models with OpenCV

5.2 Model Size and Latency Budget

Raspberry Pi 4 (1.5 GHz Cortex-A72, 4 cores):

  scikit-learn SVM:     < 1 ms inference, ~2 MB model
  ONNX small CNN:       ~3 ms inference, ~0.5 MB model
  ONNX MobileNetV2:     ~50 ms inference, ~14 MB model
  YOLOv5n (detection):  ~200 ms inference, ~4 MB model

Raspberry Pi Zero 2 W (1 GHz Cortex-A53):
  Everything ~3x slower

ESP32 (240 MHz, 520 KB RAM):
  TFLite Micro INT8:    ~50 ms for tiny model, < 100 KB model
  No Python, no OS

Rule of thumb: If you need < 10 ms inference on Pi → classical ML or tiny CNN. If you need real-time video (30 fps) → need a Pi 5 with NPU, or offload to a Coral/Jetson accelerator.

5.3 Quantization

Reducing model precision from FP32 to INT8 shrinks the model 4x and speeds up inference 2-4x on ARM:

FP32:  each weight = 32 bits → full precision, large model
FP16:  each weight = 16 bits → ~same accuracy, 2x smaller
INT8:  each weight =  8 bits → slight accuracy loss, 4x smaller, 2-4x faster

For our keystroke CNN (32×50 input, 2 conv layers, ~50K parameters): - FP32: 200 KB, ~3 ms on Pi 4 - INT8: 50 KB, ~1 ms on Pi 4

The accuracy loss from INT8 is typically < 1% for well-trained models.


6. Audio ML vs Vision ML — Parallel Concepts

Students often learn audio and vision ML separately, but the concepts map directly:

Concept Audio domain Vision domain
Raw signal Waveform (1D, time) Image (2D, spatial)
Frequency decomposition FFT / STFT 2D FFT / wavelets
Perceptual transform Mel scale (mimics ear) Color spaces (HSV, LAB)
Standard feature Mel spectrogram HOG, color histogram
Learned feature CNN on spectrogram CNN on image
Onset/detection Energy threshold Edge/contour detection
Segmentation Voice activity detection Image segmentation
Time series Audio frames → RNN/LSTM Video frames → 3D CNN
Transfer learning AudioSet pretrained ImageNet pretrained
Data augmentation Time shift, noise, gain Flip, rotate, crop, color jitter
Noise removal Spectral subtraction Gaussian blur, median filter
Real-time constraint Period budget (21 ms) Frame budget (33 ms at 30 fps)

The Spectrogram Is an Image

This is perhaps the most important conceptual bridge: a mel spectrogram IS an image. That's why image CNNs work on audio — the 2D convolution operates on frequency × time, which is structurally identical to height × width.

Audio mel spectrogram:          Image:
  ┌─────────────────┐          ┌─────────────────┐
  │ ▓▓░░░░░░░░░░░░  │          │ ░░▓▓▓░░░░░░░░░  │
  │ ▓▓▓░░░░░░░░░░░  │  ←same→  │ ░▓▓▓▓▓░░░░░░░  │
  │ ▓▓▓▓░░░░░░░░░░  │  math    │ ▓▓▓▓▓▓▓░░░░░░  │
  │ ░▓▓▓▓░░░░░░░░░  │          │ ░▓▓▓▓▓░░░░░░░  │
  └─────────────────┘          └─────────────────┘
  freq ↑   time →              y ↑   x →

A 3×3 convolution kernel on a spectrogram detects a pattern spanning 3 frequency bands over 3 time frames. On an image, the same kernel detects a pattern spanning 3 pixels vertically and 3 horizontally.


7. Practical Workflow for Embedded ML Projects

1. START WITH DSP
   └─ Can you solve it with filters, thresholds, correlations?
      └─ YES → Done. Ship it. (lowest complexity, most reliable)
      └─ NO → Continue

2. TRY CLASSICAL ML
   └─ Define features from domain knowledge
   └─ Collect 20-50 labeled examples
   └─ Train SVM or RF, evaluate with cross-validation
   └─ Accuracy good enough?
      └─ YES → Done. Export model, deploy.
      └─ NO → Continue

3. COLLECT MORE DATA
   └─ Before adding model complexity, try more data first
   └─ 50 → 200 samples often improves accuracy by 10-15%
   └─ Still not enough?
      └─ Continue

4. TRY CNN / DEEP LEARNING
   └─ Use spectrogram or raw image as input
   └─ Train on host with PyTorch/TensorFlow
   └─ Export to ONNX/TFLite for Pi deployment
   └─ Measure inference latency — fits in real-time budget?
      └─ YES → Deploy
      └─ NO → Quantize (INT8), prune, or use smaller architecture

5. OPTIMIZE FOR DEPLOYMENT
   └─ Quantization: FP32 → INT8 (4x smaller, 2-4x faster)
   └─ Pruning: remove small weights (30-50% smaller)
   └─ Knowledge distillation: train small model to mimic large one
   └─ Hardware acceleration: NPU, Coral TPU, GPU
Warning

The most common failure mode in embedded ML projects is starting at step 4. Students spend weeks training a CNN when a well-tuned threshold would have worked. Always validate the simpler approach first — it's often "good enough" and 100x simpler to deploy and maintain.


Further Reading

Textbooks: - Pete Warden & Daniel Situnayake, TinyML (O'Reilly) — the definitive guide to ML on embedded devices - Aurélien Géron, Hands-On Machine Learning (O'Reilly) — excellent ML/DL introduction with scikit-learn and TensorFlow

Courses: - Andrew Ng, Machine Learning Specialization — foundational ML concepts - 3Blue1Brown, Neural Networks — visual intuition for backpropagation - Fast.ai, Practical Deep Learning — hands-on CNN training

Deployment: - ONNX Runtime — cross-platform inference - TFLite — TensorFlow for mobile and embedded - Edge Impulse — end-to-end embedded ML platform (free tier)

Course tutorials that apply these concepts: - Acoustic Keystroke Recognition — full audio ML pipeline from feature extraction to CNN deployment - Ball Detection — classical vision pipeline with OpenCV - Camera Pipeline — capture, process, display - I2S Audio Visualizer — real-time DSP pipeline - Signal Processing Reference — sampling, FFT, filtering foundations - Audio Pipeline Latency — real-time constraints for ML inference