Acoustic Keystroke Recognition
Time: 120 min (Sections 1–10) + 60 min extension (Sections 11–12: CNN) | Prerequisites: I2S Audio Visualizer, Python basics
Use the I2S microphone to recognize which key is being typed on a nearby keyboard — purely from audio. This tutorial walks through signal conditioning, feature extraction, classifier training, and real-time inference on the Raspberry Pi.
All source code is in src/embedded-linux/scripts/acoustic-keystroke/.
Why This Works
Every key on a keyboard produces a slightly different sound depending on its position, the mechanical structure underneath, and how it resonates through the chassis. The differences are subtle — you can't hear them — but a spectrogram reveals distinct patterns per key.
Research has demonstrated > 90% accuracy on full keyboards using just a single microphone (Asonov & Agrawal 2004, Zhuang et al. 2009).
Physical model:
Key press → plunger hits membrane/switch → vibration through chassis
→ propagates to microphone → unique spectral signature
Why keys differ:
- Position: corner keys have more chassis damping than center keys
- Mechanism: different spring compression paths
- Travel distance: spacebar vs letter keys
- Finger impact: different fingers, different force profiles
Warning
This tutorial is for educational purposes — understanding audio classification, feature extraction, and embedded ML. Acoustic keystroke attacks are a real security concern. Always get consent before recording anyone's typing.
1. Architecture
┌─────────────┐ ┌───────────┐ ┌─────────────┐ ┌──────────┐ ┌──────────┐
│ I2S Mic │───▶│ Gain + │───▶│ Onset │───▶│ Feature │───▶│ ML Model │
│ (INMP441) │ │ High-pass │ │ Detection │ │ Extract │ │ (SVM / │
│ 48kHz │ │ 80Hz HPF │ │ (energy │ │ (mel │ │ RF) │
│ │ │ 10x gain │ │ ratio) │ │ spec) │ │ │
└─────────────┘ └───────────┘ └─────────────┘ └──────────┘ └──────────┘
│
┌───────────────────┘
▼
┌───────────┐
│ Predicted │
│ Key: 'a' │
└───────────┘
Three phases:
- Data collection — Type each key repeatedly while recording audio. Label each keystroke.
- Training — Extract features from labeled keystrokes, train a classifier.
- Inference — Detect keystrokes in live audio, extract features, classify.
Visual Pipeline Demo
Generate all the visualizations for this tutorial (works with synthetic data, no mic needed):
cd ~/embedded-linux/scripts/acoustic-keystroke
pip3 install numpy matplotlib scipy scikit-learn # one-time
python visualize_pipeline.py # saves PNGs
python visualize_pipeline.py -i # interactive (all plots at once)
1. Onset Detection — Shows the raw waveform with keystroke spikes, the energy-per-block trace, the energy/average ratio with the threshold line, and the 100ms capture windows. You can see exactly how the detector picks out keystrokes from background noise.
2. Keystroke Comparison — 5 different keys shown side by side: waveform, linear spectrogram, and mel spectrogram for each. Notice how each key has a unique resonance pattern — different bright bands at different frequencies. This is what the classifier learns to distinguish.
3. Feature Pipeline — Step-by-step from raw audio through Hann-windowed frames, power spectrogram, mel filterbank, mel spectrogram, to the final flattened feature vector. Annotated to show the attack transient vs resonance decay, and how each step transforms the representation.
4. Feature Space — PCA projection of keystroke features from 5 keys. Each dot is one keystroke. Keys that sound similar cluster together; well-separated clusters are easy to classify. Also shows the average mel spectrogram per key.
2. Setup
Dependencies
# On Pi — collection + inference
sudo apt install libasound2-dev libportaudio2
pip3 install numpy scipy sounddevice
# On host — training (heavier dependencies)
pip3 install numpy scipy scikit-learn sounddevice matplotlib
Verify Microphone
Use mic_test.py to check your mic with a live waveform and optional loopback to headphones:
cd ~/embedded-linux/scripts/acoustic-keystroke
# List audio devices
python mic_test.py --list
# Run with default devices — shows live waveform + level meter
python mic_test.py
# Specify input/output device and boost gain
python mic_test.py -i 4 -o 15 --gain 3.0
Type on the keyboard while watching the waveform. You should see clear spikes for each keystroke. If the signal is barely visible, increase --gain or move the mic closer.
Alternatively, use arecord for a quick check:
3. Signal Conditioning
The raw I2S mic signal is weak (typical keystroke energy is 1e-7) and contaminated by low-frequency rumble from vibrations, air conditioning, etc. Two processing steps bring keystrokes above the noise floor:
3.1 Software Gain
The INMP441 outputs a 24-bit signal that maps to very small float32 values. A 10x software gain brings keystroke transients into a usable range without clipping (keyboard sounds rarely exceed 0.1 even after 10x amplification):
3.2 High-Pass Filter
A first-order IIR high-pass filter at 80 Hz removes rumble while preserving the keystroke transients (which are broadband, 100 Hz – 12 kHz):
class HighPass:
"""First-order IIR high-pass filter."""
def __init__(self, cutoff_hz, rate):
rc = 1.0 / (2.0 * np.pi * cutoff_hz)
dt = 1.0 / rate
self.alpha = rc / (rc + dt)
self.prev_in = 0.0
self.prev_out = 0.0
def process(self, samples):
out = np.empty_like(samples)
a = self.alpha
yi, xi_prev = self.prev_out, self.prev_in
for i in range(len(samples)):
xi = samples[i]
yi = a * (yi + xi - xi_prev)
xi_prev = xi
out[i] = yi
self.prev_in = xi_prev
self.prev_out = yi
return out
Why not a higher-order filter?
A first-order filter has a gentle -6 dB/octave slope, which is enough for our purpose. Higher-order filters introduce phase distortion near the cutoff that can smear the keystroke onset — the sharp transient is the most important feature for onset detection.
4. Onset Detection

Four panels showing the onset detection pipeline: raw waveform with keystroke spikes (top), block energy vs running average (second), energy/average ratio with 5× threshold (third), and 100ms feature extraction windows (bottom). Red vertical lines mark detected onsets.
Run locally: cd scripts/acoustic-keystroke && python visualize_pipeline.py (generates all 4 figures) or python visualize_pipeline.py -i (interactive)
A keystroke creates a sharp energy spike against a quiet background. We detect it by comparing short-term energy to a running exponential average:
energy = np.sum(audio ** 2) / len(audio) # energy of this block
energy_avg = energy_avg * 0.92 + energy * 0.08 # EMA (τ ≈ 120ms)
if energy > energy_avg * threshold: # default threshold: 5.0
# keystroke detected!
Key parameters:
| Parameter | Default | Effect |
|---|---|---|
threshold |
5.0 | Higher = fewer false positives, may miss soft keystrokes |
min_energy |
1e-5 | Absolute floor — ignores noise spikes when everything is quiet |
cooldown |
400 ms | Prevents double-detection (press + release echo) |
Training vs Inference Onset Detection
In the training pipeline (features.py), onset detection runs offline over the full recording — it uses a frame-by-frame energy scan with a convolved running average. This is more accurate because it can look forward and backward.
In live inference, detection must be causal (no lookahead). The EMA-based detector runs per audio block (10 ms). When an onset is detected, the system collects 100 ms of audio after the onset before classifying — this captures the resonance tail which contains the most discriminative information.
onset
│
▼
─────┬──────────────────────────┐
5ms│ 100ms post-onset │
pre │ (attack + resonance) │
─────┴──────────────────────────┘
◄──────── feature window ──►
Warning
A common bug: extracting audio before the onset instead of after. The pre-onset audio is just silence/noise — the discriminative spectral content is in the 100 ms resonance tail.
5. Feature Extraction
What Is a Feature?
A machine learning classifier cannot listen to audio the way you do. It needs numbers — a list of values that describe the sound in a way that makes different keys distinguishable. These numbers are called features.
Think of it like describing a person to someone who has never seen them: you wouldn't transmit every pixel of a photo. Instead, you'd say "tall, brown hair, glasses" — a compact description that captures the essential differences. Features are that description for audio.
Why not just feed raw audio samples to the classifier?
A 100 ms keystroke at 48 kHz is 4,800 numbers. You could feed all 4,800 to the classifier, but this is a bad idea:
-
Too many dimensions. With 4,800 input features and only 20 training examples per key, the classifier has far more parameters to fit than data to learn from — it memorizes the training examples (overfitting) instead of learning generalizable patterns.
-
Irrelevant information. Most of those 4,800 numbers encode information that doesn't help distinguish keys — the exact phase of the wave, the precise timing of the onset, background noise. The classifier wastes capacity trying to learn patterns in noise.
-
Not shift-invariant. If the onset is detected 1 ms earlier in one sample, all 4,800 values shift. The classifier sees this as a completely different pattern, even though it's the same keystroke.
Good features solve all three problems: they're compact (hundreds, not thousands of values), they capture only the discriminative information (which frequencies are present, how they decay), and they're invariant to irrelevant variation (onset timing, overall volume).
Raw audio (bad features): Mel spectrogram (good features):
4,800 numbers 1,600 numbers (~3x smaller)
├─ exact waveform shape ├─ frequency content (which resonances)
├─ onset timing ├─ temporal evolution (how they decay)
├─ phase (irrelevant) └─ compressed frequency axis (mel scale)
├─ noise samples ↓
└─ overall amplitude Invariant to phase, onset shift,
↓ and background noise level
Everything is relevant to
the classifier → overfitting
The process of converting raw sensor data into good features is called feature extraction (or feature engineering). It's the single most important step in classical ML — the classifier can only find patterns in what you give it.
Features in Other Domains
This concept applies everywhere, not just audio:
| Domain | Raw data | Good features |
|---|---|---|
| Audio (this tutorial) | 4,800 samples | Mel spectrogram (32 × 50) |
| Images | 640 × 480 pixels | HOG descriptors, color histograms |
| IMU sensor | Accelerometer time series | Mean, std, FFT peaks, zero crossings |
| Text | Raw characters | Word embeddings, TF-IDF vectors |
| Network traffic | Packet bytes | Flow statistics, port distributions |
In deep learning (Section 11), the CNN learns its own features from the mel spectrogram — the first convolutional layers automatically discover patterns like "energy at 800 Hz decaying over 20 ms." But this requires much more data. With small datasets, hand-crafted features + classical ML wins because your domain knowledge compensates for limited examples.
What Makes a Good Feature for Keystrokes?
Look at the spectrograms of different keys:

Five keys compared: waveform (top), linear spectrogram (middle), and mel spectrogram (bottom). Notice how each key excites different resonance frequencies — 'a' has strong energy around 800 Hz, while 'k' resonates higher around 2 kHz. These spectral differences are what the classifier learns to distinguish.
Run locally: python scripts/acoustic-keystroke/visualize_pipeline.py
Each key has a unique spectral signature — a pattern of which frequencies are excited and how they decay over time. This is determined by the physical properties of the key's position on the keyboard:
- Position affects which part of the chassis resonates (corner keys sound different from center keys)
- Key mechanism affects the attack transient (the initial click)
- Finger affects the impact force and angle
The mel spectrogram captures exactly this: which frequencies (vertical axis) are present at each moment (horizontal axis). It's the natural representation for "what does this sound look like?"
Tip
See it interactively: Run the mel spectrogram explorer on your host PC:
Drag the sliders to see how FFT size, hop size, and number of mel bands affect the output.5.1 Mel Spectrogram — How It's Built
The core feature is a mel-scaled spectrogram — a time-frequency representation that emphasizes the frequency ranges where keystroke differences are most apparent.
Step 1: STFT — Split the segment into overlapping frames, apply a Hanning window, compute the FFT of each frame:
# Parameters
N_FFT = 512 # ~10.7ms frame at 48kHz
hop = 96 # 2ms hop → ~50 frames per keystroke
hann = np.hanning(N_FFT)
for i in range(n_frames):
frame = segment[i * hop : i * hop + N_FFT]
fft = np.abs(rfft(frame * hann)) ** 2 # power spectrum
spec[:, i] = fft
Step 2: Mel filterbank — Apply triangular filters spaced according to the mel scale. This compresses high frequencies (where our ears and keystroke physics are less discriminative) while preserving detail in the low-mid range:
Hz: 100 200 400 800 1.6k 3.2k 6.4k 12k
│ │ │ │ │ │ │ │
Mel: ├──┤├──┤├──┤├───┤├────┤├──────┤├──────────┤├────────┤
Lots of detail here Less detail here
(100-2kHz) (2-12kHz)
The mel scale conversion:
We use 32 mel bands from 100 Hz to 12 kHz. Below 100 Hz is rumble (already filtered), above 12 kHz is hiss with no useful keystroke information.
Warning
Why not more mel bands? With FFT size 512 (our default), there are only 257 frequency bins. If you increase mel bands beyond ~32, the upper triangular filters become so narrow that adjacent centers map to the same FFT bin — the filter captures zero energy, producing black lines in the spectrogram. This is not a bug; it's a fundamental resolution limit: you can't have more mel bands than FFT bins support. To use more bands, increase the FFT size (e.g., 64 bands needs FFT ≥ 1024). Try it in the mel_spectrogram_explorer.py -i demo — the info panel warns when this happens.
Step 3: Log compression — Convert power to decibels. This compresses the dynamic range and makes the features more Gaussian (better for classifiers):

The complete feature extraction pipeline: raw keystroke with annotated attack/resonance (1), Hann-windowed overlapping frames (2), power spectrogram from STFT (3), triangular mel filterbank showing dense low-freq and sparse high-freq bands (4), mel spectrogram (5), and the final flattened feature vector ready for SVM/CNN input (6).
Run locally: python scripts/acoustic-keystroke/visualize_pipeline.py
5.2 Amplitude Normalization
Different recording sessions may have different mic positions or gain settings. Normalizing each keystroke segment to unit peak amplitude makes the features invariant to overall volume:
This is critical — without it, a model trained at one mic distance fails at another.
5.3 Additional Features
Beyond the raw mel spectrogram, we extract three supplementary features that capture temporal dynamics:
| Feature | Shape | What it captures |
|---|---|---|
| Temporal envelope | (n_frames,) | Mean mel energy per frame — the attack/decay shape |
| Spectral centroid | (n_frames,) | Which frequency band dominates at each time step |
| Delta (first derivative) | (32, n_frames-1) | How the spectrum changes between frames |
# Temporal envelope: energy contour
frame_energy = np.mean(mel_spec, axis=0)
# Spectral centroid: "brightness" over time
centroid = Σ(freq × power) / Σ(power) # per frame
# Delta: spectral change rate
delta = np.diff(mel_spec, axis=1)
The final feature vector concatenates all four: mel_spec.flatten() + envelope + centroid + delta.flatten() → ~3100 features per keystroke.
Why not MFCC?
MFCCs (Mel-Frequency Cepstral Coefficients) are the standard in speech recognition — they apply a DCT to decorrelate the mel bands. For keystroke recognition, we found that the raw mel spectrogram + delta features work well enough, and the implementation is simpler (no librosa dependency). If you want to experiment, see the Alternatives section.
6. Data Collection
Two collection methods are provided, each with different strengths:
6.1 Per-Key Collection (collect_keystrokes.py)
Press each key 20+ times in isolation. Simple but produces "robotic" data:
Features:
- Mic check at startup — shows live level meter, warns if signal is weak
- Live progress bar with SNR per detection
- 80 Hz high-pass filter applied during recording
- Per-key summary with average SNR
6.2 Typing Practice (typing_practice.py)
Type natural phrases — pangrams, common words, key-focused drills:
Features:
- Shows phrases to type with green/red feedback
- Records audio continuously + timestamps each keypress via raw terminal input
- Mistyped keys are excluded from training
- Produces more natural data — variable speed, finger transitions
Tip
Best approach: Use per-key collection for initial baseline, then augment with typing practice data. Both formats are loaded automatically by train_model.py.
6.3 Workflow: Collect on Pi, Train on Host
The Pi has the I2S mic but is slow for training (SVM with 3000+ samples and 3100 features takes minutes). The recommended workflow:
┌──────────────┐ scp ┌──────────────┐ scp ┌──────────────┐
│ Raspberry Pi │ ─────▶ │ Host PC │ ─────▶ │ Raspberry Pi │
│ │ │ │ │ │
│ 1. Collect │ │ 2. Train │ │ 3. Inference │
│ data │ │ model │ │ (live) │
└──────────────┘ └──────────────┘ └──────────────┘
Step 1 — Collect on Pi (has the I2S mic):
# On Pi
cd ~/embedded-linux/scripts/acoustic-keystroke
python collect_keystrokes.py --gain 10 --presses 30
python typing_practice.py --gain 10 --rounds 10
Step 2 — Train on host (faster CPU, more RAM):
# Copy data from Pi to host
scp -r pi@raspberrypi:~/embedded-linux/scripts/acoustic-keystroke/keystroke_data ./
# Train (much faster on host)
python train_model.py --augment keystroke_data
Step 3 — Deploy back to Pi:
# Copy model back to Pi
scp keystroke_model.pkl pi@raspberrypi:~/embedded-linux/scripts/acoustic-keystroke/
# Run inference on Pi
ssh pi@raspberrypi
cd ~/embedded-linux/scripts/acoustic-keystroke
python live_inference.py --gain 10
Tip
Quick iteration: Use rsync instead of scp to sync only changed files:
6.4 Collection Tips
- Place mic 5–10 cm from keyboard, same position each session
- Mechanical keyboards produce stronger signals than membrane
- Minimize background noise — fan, music, talking all hurt SNR
- If SNR < 8x, move mic closer or increase
--gain - Collect at least 30 samples per key for decent accuracy
7. Training the Classifier

PCA projection of keystroke features from 5 keys (30 samples each). Each dot is one keystroke. Well-separated clusters (like 'a' vs 'space') are easy to classify; overlapping clusters would confuse the model. Right: average mel spectrogram per key shows why they cluster — each key has a distinct spectral signature.
Run locally: python scripts/acoustic-keystroke/visualize_pipeline.py (requires scikit-learn)
7.1 The Pipeline
Training uses scikit-learn's Pipeline — a chain of preprocessing + classifier that ensures the same transformations are applied at training and inference time:
pipeline = Pipeline([
("scaler", StandardScaler()), # normalize features to zero mean, unit variance
("clf", SVC(kernel='rbf', ...)) # the actual classifier
])
The StandardScaler is critical because our features have very different scales — mel spectrogram values are in dB (–100 to 0), centroids are in band indices (0–32), deltas are small differences. Without scaling, features with large numeric ranges dominate the classifier.
7.2 Support Vector Machine (SVM)
Tip
See it interactively: Run the ML decision boundary demo to visualize how SVM, Random Forest, and k-NN work on 2D data:
Drag the "Samples" slider to see how more data sharpens the decision boundary. Switch between classifiers. This is exactly what happens in 3100D with your keystroke features — you just can't plot it.The default classifier is an SVM with RBF (Radial Basis Function) kernel. Here's why it works well for this problem:
What SVM does: Finds a decision boundary that maximizes the margin between classes. In high-dimensional feature space (~3100 dimensions), there's usually a separable hyperplane even for 27 classes.
The RBF kernel: Maps features into an infinite-dimensional space where linear separation becomes possible. The kernel function measures similarity between two feature vectors:
Two keystrokes that produce similar spectrograms have high kernel value → classified together.
Key hyperparameters:
| Parameter | Value | Role |
|---|---|---|
C |
10 | Regularization — higher allows more complex boundaries, risk of overfitting |
gamma |
'scale' | RBF width — auto-scales to 1 / (n_features × variance) |
probability |
True | Enables confidence scores (needed for thresholding in live inference) |
SVM vs Neural Networks
For this dataset size (500–3000 samples, ~3100 features), SVM typically outperforms simple neural networks. SVMs are mathematically guaranteed to find the maximum-margin solution, while neural networks can get stuck in local minima. Deep learning (CNNs) only wins when you have 10,000+ samples and can learn features end-to-end.
7.3 Cross-Validation
We evaluate with 5-fold stratified cross-validation — the dataset is split into 5 parts, the model trains on 4 and tests on the held-out 1, rotating 5 times:
Fold 1: [TEST] [train] [train] [train] [train] → 78%
Fold 2: [train] [TEST] [train] [train] [train] → 82%
Fold 3: [train] [train] [TEST] [train] [train] → 74%
Fold 4: [train] [train] [train] [TEST] [train] → 80%
Fold 5: [train] [train] [train] [train] [TEST] → 79%
Mean: 78.6%
This gives an honest accuracy estimate. If training accuracy is 100% but CV accuracy is 65%, the model is overfitting — memorizing training data rather than learning generalizable patterns. The gap should be < 15%.
7.4 Data Augmentation
Generate synthetic training samples from existing data to improve generalization:
def augment_segment(segment):
augmented = []
# Time shift: ±3ms (simulates slightly different onset alignment)
shift = int(RATE * 0.003)
augmented.append(np.roll(segment, shift))
augmented.append(np.roll(segment, -shift))
# Additive noise (simulates different noise floors)
augmented.append(segment + np.random.randn(len(segment)) * 0.002)
# Gain variation ±15% (simulates different mic distances)
augmented.append(segment * 1.15)
augmented.append(segment * 0.85)
return augmented # 5 extra samples per original
With --augment, each keystroke produces 6 training samples (1 original + 5 augmented), which typically improves accuracy by 5–10%.
7.5 Training Commands
# Train with SVM (default, recommended)
python train_model.py --augment keystroke_data
# Train with Random Forest (faster, slightly lower accuracy)
python train_model.py --model rf --augment keystroke_data
# Output includes confusion pairs — which keys get mixed up
8. Real-Time Inference
8.1 How It Works
The inference loop runs in a sounddevice audio callback at 48 kHz with 10 ms blocks:
State machine:
┌──────────────────┐
┌─────────────────▶│ IDLE │
│ │ (monitor energy) │
│ └────────┬─────────┘
│ │ energy > threshold × avg
│ │ AND energy > min_energy
│ ▼
│ ┌──────────────────┐
│ │ COLLECTING │
│ │ (buffer 100ms │
│ │ post-onset) │
│ └────────┬─────────┘
│ │ 100ms collected
│ ▼
│ ┌──────────────────┐
│ │ CLASSIFY │
│ │ extract_features()│
│ │ model.predict() │
│ └────────┬─────────┘
│ │
│ ┌────────▼─────────┐
└──────────────────│ COOLDOWN │
│ (400ms silence) │
└──────────────────┘
Critical: the system collects audio after the onset, not before. The onset triggers collection; classification happens 100 ms later once the full keystroke resonance is captured.
8.2 Running
# Normal mode — predicted keys appear as you type
python live_inference.py --gain 10
# Debug mode — shows energy, detections, confidence scores
python live_inference.py --gain 10 --debug
# Tune detection sensitivity
python live_inference.py --gain 10 --threshold 3.0 --min-energy 5e-6
8.3 Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| No detections at all | Signal too weak | Increase --gain, move mic closer |
| Ghost detections on silence | min-energy too low | Increase --min-energy |
| Double detections per keystroke | Cooldown too short | Increase --cooldown-ms 500 |
| Detects but always wrong key | Model accuracy too low | Re-collect data, more samples, fewer keys |
| Low confidence (< 20%) | Features don't match training | Ensure same --gain in collection and inference |
9. How the Models Work — In Depth
9.1 SVM with RBF Kernel (Default)
Intuition: Imagine each keystroke as a point in 3100-dimensional space. Keys that sound similar cluster together. SVM draws boundaries between clusters with maximum clearance (margin).
The kernel trick: In the original feature space, clusters may overlap. The RBF kernel implicitly maps points to a higher-dimensional space where they become separable:
Multi-class strategy: SVM is inherently binary. For 27 keys, scikit-learn uses one-vs-one — trains 27×26/2 = 351 binary classifiers, each distinguishing one pair of keys. Final prediction is by majority vote.
Probability calibration: SVM outputs are distances from the decision boundary, not probabilities. With probability=True, Platt scaling fits a sigmoid to convert distances → probabilities. This is what enables the confidence threshold in live inference.
Tradeoffs:
| SVM (RBF) | Random Forest | |
|---|---|---|
| Accuracy | Higher (75–90%) | Good (70–85%) |
| Training time | Slower (O(n²) with n samples) | Faster |
| Inference time | Fast (few support vectors) | Fast (tree traversal) |
| Interpretability | Low (black box) | Medium (feature importance) |
| Data requirements | Works with 20+ samples/class | Needs 30+ samples/class |
9.2 Random Forest (Alternative)
How it works: Trains 200 independent decision trees, each on a random subset of the data and features. Each tree votes for a class; the majority wins.
Why it's useful:
- Fast to train and predict
- Provides
feature_importances_— shows which mel bands and time frames matter most - Less prone to overfitting than a single decision tree
- No hyperparameter tuning needed (works out of the box)
9.3 Why Not Deep Learning?
A CNN could learn features end-to-end from raw spectrograms and likely reach 90%+ accuracy. But for this tutorial:
- Data size: We have ~20–50 samples per class. CNNs need thousands.
- Complexity: A CNN requires PyTorch/TensorFlow, GPU for training, and careful architecture design.
- Inference cost: On a Pi, scikit-learn prediction takes < 1 ms. A CNN takes 10–50 ms.
- Explainability: With mel features + SVM, you can inspect what the model sees. With a CNN, it's opaque.
If you have enough data (500+ samples per key) and want maximum accuracy, see Alternatives.
10. Visualization with Audio Viz
Use the audio visualizer to see the keystroke patterns. Run audio_viz_full and type on the keyboard near the microphone:
- Waveform — Sharp spikes appear at each keystroke
- Spectrogram — Vertical lines (broadband transients) with distinct frequency patterns
- Spectrum — Different keys excite different frequency peaks
This is a great way to build intuition before diving into the ML pipeline. The TDOA overlay also shows interesting effects — each key produces sound from a slightly different position on the keyboard.
Alternatives to Explore
Tip
MFCC Features
Replace the mel spectrogram with MFCCs — apply a Discrete Cosine Transform to the mel bands, keep the first 13 coefficients. This decorrelates the features and works better with Gaussian classifiers (GMM, linear SVM). Requires librosa or a manual DCT implementation.
Tip
CNN on Spectrograms Treat each keystroke's mel spectrogram as a (32 × 50) grayscale image. Use a small CNN (2 conv layers, pool, flatten, dense). With PyTorch:
class KeyCNN(nn.Module):
def __init__(self, n_classes):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(1, 16, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
)
self.fc = nn.Sequential(
nn.Linear(32 * 8 * 12, 64), nn.ReLU(),
nn.Linear(64, n_classes)
)
Tip
Gaussian Mixture Model (GMM) Train one GMM per key (like speaker verification). At inference, compute the log-likelihood of the keystroke under each GMM; pick the highest. Works well with MFCC features and adapts to new keys without full retraining.
Tip
k-Nearest Neighbors (k-NN) The simplest classifier — compare a new keystroke to all training examples, pick the most common class among the k nearest. No training needed, but slow at inference and accuracy depends entirely on feature quality. Good for quick experiments.
Tip
Embedded Deployment (ONNX) Export the trained scikit-learn model to ONNX format for C/C++ inference:
This enables inference without Python — useful for integrating with the SDL2 audio visualizer.Challenges
Tip
Challenge 1: Confusion Matrix
After training, print a confusion matrix (sklearn.metrics.confusion_matrix). Which keys are most often confused? Do confused keys share physical proximity on the keyboard?
Tip
Challenge 2: SDL2 Live Display
Extend audio_viz_full.c to show the predicted key on screen. Add a text field that accumulates predicted characters. Hint: run the Python inference script as a subprocess and read its stdout.
Tip
Challenge 3: Two-Microphone Improvement Use stereo capture (-c 2) to add TDOA-based position estimation as an extra feature. Each key has a different position → different arrival time difference. Does this improve accuracy?
Tip
Challenge 4: Security Implications Write a 1-page analysis: if this attack works at 80% accuracy, what are the implications? How would you defend against it? Consider: noise injection, randomized key sounds, on-screen keyboards, keystroke timing randomization.
Tip
Challenge 5: Cross-Session Robustness Collect training data on day 1, test on day 2 with slightly different mic placement. How much does accuracy drop? Experiment with adding data augmentation (gain variation, noise injection) to improve robustness.
11. From Classical ML to Deep Learning
The SVM and Random Forest classifiers work well with hand-crafted features, but they have a ceiling: you must decide what features to extract (mel bands, deltas, centroids), and the classifier can only work with what you give it. Deep learning flips this — the model learns its own features directly from the spectrogram.
This section builds a CNN (Convolutional Neural Network) that takes the raw mel spectrogram as input and learns to recognize keystrokes end-to-end. We'll see when this helps, when it doesn't, and how to deploy it on the Pi.
11.1 Why a CNN for Audio?
A mel spectrogram is a 2D matrix — frequency bins on one axis, time frames on the other. This is structurally identical to a grayscale image. CNNs excel at learning local patterns in images:
Mel spectrogram (32 × 50): What the CNN learns:
░░▓▓▓░░░░░░░░░░░░░░░ "Bright band at 800 Hz
░░▓▓▓▓▓░░░░░░░░░░░░░ that fades over 20 ms"
░░░▓▓▓▓▓▓░░░░░░░░░░░ = key 'a'
░░░░▓▓▓▓▓▓▓░░░░░░░░░
░░░░░░▓▓▓▓▓▓░░░░░░░░ "Two bands at 400+2k Hz
░░░░░░░░▓▓▓▓▓░░░░░░░ with fast decay"
▲ = key 's'
freq
time ──────────▶
The first convolutional layers learn spectro-temporal patterns — combinations of frequency bands that activate together over specific time windows. Deeper layers combine these into higher-level representations. The final layer maps these to key classes.
Convolution in Audio vs Images
In image CNNs, convolution kernels slide over 2D spatial dimensions (height × width). In our audio CNN, the two dimensions are frequency (mel bands) and time (frames). A 3×3 kernel learns relationships between adjacent frequency bands over 3 consecutive time frames — perfect for capturing the spectral evolution of a keystroke.
This is different from 1D CNNs used in raw waveform processing, where convolution operates only over the time axis. The 2D approach leverages the spectrogram's structure.
11.2 When Does CNN Beat SVM?
Not always. The tradeoff depends on data quantity:
| Samples per key | SVM accuracy | CNN accuracy | Winner |
|---|---|---|---|
| 20 | 70–80% | 40–60% | SVM — CNN overfits badly |
| 50 | 75–85% | 65–80% | SVM — still not enough data |
| 100 | 78–88% | 80–88% | Tied |
| 200+ | 80–90% | 88–95% | CNN — learned features surpass hand-crafted |
| 500+ | 82–90% | 92–97% | CNN — significant advantage |
Key insight: SVM with hand-crafted features has a lower data floor (works with 20 samples) but a lower accuracy ceiling. CNN has a higher data floor (needs 100+ samples) but a higher ceiling. This is a fundamental pattern in ML — it's not that one method is "better," it's about the data regime you're operating in.
The Bias-Variance Tradeoff
This is one of the most important concepts in machine learning. Every model makes a tradeoff:
- High bias (underfitting): The model is too simple to capture the patterns. SVM with bad features has this problem — it can't learn what it can't see.
- High variance (overfitting): The model is too complex for the data. A CNN with 20 samples memorizes each example instead of learning generalizable patterns.
With small data → prefer simpler models (SVM, RF). With large data → complex models (CNN) can learn richer representations. Adding regularization (dropout, data augmentation) shifts the tradeoff, letting complex models work with less data.
11.3 Collecting More Data
To give the CNN enough data, collect 50+ samples per key using both collection methods:
# Per-key collection: 50 presses per key for all lowercase + space
python collect_keystrokes.py --gain 10 --presses 50 --keys "abcdefghijklmnopqrstuvwxyz "
# Typing practice: 20 rounds of natural typing
python typing_practice.py --gain 10 --rounds 20
Tip
Data quality matters more than quantity. 50 clean samples beats 200 noisy ones. Before collecting:
- Minimize background noise (close windows, turn off fans)
- Keep mic position fixed
- Type at your normal speed (don't artificially slow down)
- Verify with mic_test.py that SNR > 10x before starting
11.4 Building the CNN
We use PyTorch for the CNN because it's explicit — you see every layer, every dimension, every operation. No "magic."
Setup
# On host (training) — Pi is too slow for CNN training
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
Network Architecture
The CNN is deliberately small — 2 conv layers, then a fully connected classifier. This is enough for 32×50 spectrograms with 27 classes:
import torch
import torch.nn as nn
class KeystrokeCNN(nn.Module):
def __init__(self, n_classes, n_mels=32, n_frames=50):
super().__init__()
# Feature extraction: two conv blocks
self.features = nn.Sequential(
# Block 1: 1 → 16 channels, 3×3 conv + ReLU + pool
nn.Conv2d(1, 16, kernel_size=3, padding=1),
nn.BatchNorm2d(16),
nn.ReLU(),
nn.MaxPool2d(2), # (16, 16, 25)
# Block 2: 16 → 32 channels
nn.Conv2d(16, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2), # (32, 8, 12)
)
# Classifier
flat_size = 32 * (n_mels // 4) * (n_frames // 4)
self.classifier = nn.Sequential(
nn.Dropout(0.3), # regularization
nn.Linear(flat_size, 64),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(64, n_classes)
)
def forward(self, x):
# x shape: (batch, 1, n_mels, n_frames)
x = self.features(x)
x = x.view(x.size(0), -1) # flatten
x = self.classifier(x)
return x
Layer by Layer
Conv2d(1, 16, 3): Takes the single-channel spectrogram and applies 16 different 3×3 filters. Each filter learns to detect a different spectro-temporal pattern (e.g., "energy at 1 kHz decaying over 2 frames"). Output: 16 feature maps of the same size.
BatchNorm2d(16): Normalizes each feature map to zero mean and unit variance. This stabilizes training — without it, deeper layers see wildly varying input ranges and learn slowly.
ReLU: max(0, x) — zeroes out negative activations. This introduces non-linearity, allowing the network to learn complex patterns. Without it, stacking linear layers would be equivalent to a single linear layer.
MaxPool2d(2): Takes every 2×2 block and keeps only the maximum. This halves the spatial dimensions (32×50 → 16×25), making the network invariant to small shifts in onset timing or frequency.
Dropout(0.3): Randomly zeroes 30% of activations during training. Forces the network to not rely on any single neuron — a powerful regularizer that prevents overfitting, especially critical with small datasets.
Linear(flat_size, 64): Fully connected layer that combines all the learned features into 64 abstract representations.
Linear(64, n_classes): Final layer — 27 outputs (one per key). The highest output is the predicted class.
Training Loop
def train_cnn(X, y, n_classes, epochs=50, lr=0.001, batch_size=32):
"""Train CNN on mel spectrogram data.
X: numpy array (n_samples, n_mels, n_frames)
y: numpy array of integer labels (0..n_classes-1)
"""
# Convert to PyTorch tensors
X_tensor = torch.FloatTensor(X).unsqueeze(1) # add channel dim
y_tensor = torch.LongTensor(y)
dataset = torch.utils.data.TensorDataset(X_tensor, y_tensor)
# 80/20 train/val split
n_val = len(dataset) // 5
n_train = len(dataset) - n_val
train_set, val_set = torch.utils.data.random_split(
dataset, [n_train, n_val])
train_loader = torch.utils.data.DataLoader(
train_set, batch_size=batch_size, shuffle=True)
val_loader = torch.utils.data.DataLoader(
val_set, batch_size=batch_size)
model = KeystrokeCNN(n_classes)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
for epoch in range(epochs):
# Training
model.train()
train_loss = 0
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
output = model(X_batch)
loss = criterion(output, y_batch)
loss.backward()
optimizer.step()
train_loss += loss.item()
# Validation
model.eval()
correct, total = 0, 0
with torch.no_grad():
for X_batch, y_batch in val_loader:
output = model(X_batch)
_, predicted = torch.max(output, 1)
total += y_batch.size(0)
correct += (predicted == y_batch).sum().item()
val_acc = correct / total if total > 0 else 0
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}/{epochs} "
f"loss={train_loss/len(train_loader):.3f} "
f"val_acc={val_acc:.1%}")
return model
What Happens During Training
Each epoch passes through the entire training set once. In each step:
- Forward pass: input spectrogram → conv layers → predicted class probabilities
- Loss calculation:
CrossEntropyLossmeasures how far the prediction is from the true label. If the model is confident and correct → low loss. Confident and wrong → high loss. - Backward pass (backpropagation): Compute the gradient of the loss with respect to every weight in the network. This tells each weight "which direction should I change to reduce the loss?"
- Optimizer step:
Adamupdates each weight by a small amount in the gradient direction. The learning rate (0.001) controls how big each step is.
Over 50 epochs, the weights gradually adjust until the network correctly classifies most training examples. The validation accuracy tells us if this generalizes to unseen data.
Signs of trouble: - Training accuracy high, validation low → overfitting (need more data or more dropout) - Both accuracies plateau early → underfitting (need more capacity or lower learning rate) - Loss oscillates wildly → learning rate too high (reduce by 10x)
11.5 SVM vs CNN: A Fair Comparison
Run both on the same dataset to see the difference:
# Train SVM (baseline)
python train_model.py --augment keystroke_data
# → "SVM accuracy: 82.3% (±3.1%)"
# Train CNN (compare)
python train_model.py --model cnn --augment keystroke_data
# → "CNN accuracy: 87.5% (±2.8%)" (with 100+ samples/key)
The training script handles both models through the --model flag. The CNN uses the raw mel spectrogram (32×50 matrix) while the SVM uses the flattened spectrogram + extra features (3100-element vector).
Note
Why the CNN might not win with small data: With 20 samples per key, the CNN has ~50,000 parameters to learn from ~540 examples. That's 92 parameters per example — severe overfitting is guaranteed. The SVM, by contrast, has a mathematically principled regularization (the margin) that works even with very few samples.
The real lesson: Neither model is universally "better." The right choice depends on your data budget, latency requirements, and deployment constraints. This is true across all of ML — not just keystroke recognition.
11.6 Deploying the CNN on Raspberry Pi
The trained PyTorch model can't run efficiently on the Pi's ARM CPU. We export it to ONNX (Open Neural Network Exchange) format and run inference with the lightweight ONNX Runtime:
Export
import torch
# Load trained model
model = KeystrokeCNN(n_classes)
model.load_state_dict(torch.load("keystroke_cnn.pt"))
model.eval()
# Export to ONNX
dummy = torch.randn(1, 1, 32, 50) # batch=1, channels=1, mels=32, frames=50
torch.onnx.export(model, dummy, "keystroke_cnn.onnx",
input_names=["spectrogram"],
output_names=["logits"],
dynamic_axes={"spectrogram": {0: "batch"}})
Install ONNX Runtime on Pi
Inference
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("keystroke_cnn.onnx")
def predict_cnn(mel_spec):
"""mel_spec: numpy array (32, 50)"""
input_data = mel_spec[np.newaxis, np.newaxis, :, :].astype(np.float32)
logits = session.run(None, {"spectrogram": input_data})[0]
predicted = np.argmax(logits, axis=1)[0]
confidence = np.exp(logits[0]) / np.sum(np.exp(logits[0])) # softmax
return predicted, confidence[predicted]
Latency Comparison
| SVM (scikit-learn) | CNN (ONNX Runtime) | |
|---|---|---|
| Model size | ~2 MB (.pkl) | ~0.5 MB (.onnx) |
| Inference time (Pi 4) | < 1 ms | ~3 ms |
| Inference time (Pi Zero) | ~5 ms | ~15 ms |
| Dependencies | scikit-learn (~50 MB) | onnxruntime (~5 MB) |
| Training | Host CPU, minutes | Host CPU, minutes (GPU: seconds) |
Both are well within the real-time budget — onset detection gives us 100 ms of collection time plus 400 ms cooldown, and inference takes < 5 ms even on the slowest Pi.
Why ONNX and Not TFLite?
Both are valid deployment formats. ONNX has better scikit-learn interoperability (via skl2onnx) and a simpler Python API. TFLite is better if you're using TensorFlow/Keras for training and want int8 quantization for MCU deployment. For Pi-class hardware, ONNX Runtime is the simpler path.
For MCU deployment (ESP32, STM32), TFLite Micro with int8 quantization would be the right choice — but that's beyond the scope of this tutorial.
12. Understanding What the CNN Learns
One criticism of deep learning is that it's a "black box." But we can peek inside.
12.1 Visualizing Filters
The first conv layer's 16 filters show what low-level patterns the network learned:
# Extract first conv layer weights
weights = model.features[0].weight.detach().numpy() # (16, 1, 3, 3)
# Plot as 16 small heatmaps
fig, axes = plt.subplots(2, 8, figsize=(12, 3))
for i, ax in enumerate(axes.flat):
ax.imshow(weights[i, 0], cmap='RdBu', vmin=-0.5, vmax=0.5)
ax.set_title(f'F{i}')
ax.axis('off')
Typical patterns you'll see: - Horizontal edges: frequency band boundaries (specific resonances) - Vertical edges: onset transients (the "click" of the keypress) - Diagonal patterns: frequency sweeps (resonance decay)
12.2 Confusion Heatmap
Which keys does the CNN confuse?
from sklearn.metrics import confusion_matrix
import seaborn as sns
# Run predictions on validation set
y_pred = model_predict_all(X_val)
cm = confusion_matrix(y_val, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d',
xticklabels=key_names, yticklabels=key_names)
plt.xlabel('Predicted')
plt.ylabel('True')
Expected patterns: - Adjacent keys (e.g., 'f'/'g') are most confused — similar position, similar resonance - Keys pressed with the same finger (e.g., 'q'/'a'/'z') share activation patterns - Spacebar and Enter are rarely confused with letter keys (very different mechanics)
12.3 t-SNE: Visualizing the Feature Space
t-SNE reduces the CNN's internal representation from 64 dimensions to 2D for visualization:
from sklearn.manifold import TSNE
# Get features from the layer before the final classifier
features = model.get_intermediate_features(X_all) # (n_samples, 64)
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
embedded = tsne.fit_transform(features)
plt.scatter(embedded[:, 0], embedded[:, 1], c=y_all, cmap='tab20', s=5)
If the model works well, you'll see distinct clusters — one per key. Keys that the model confuses will have overlapping clusters. This visualization immediately shows: - Which keys are easy — well-separated clusters - Which keys are hard — overlapping clusters - Whether more data would help — if clusters have clear structure but thin boundaries, more data would sharpen them
13. Background Data Collection
Manual collection ("press A 50 times") is tedious and produces robotic, unnatural data. Real typing has variable speed, finger transitions, and rhythm that a classifier needs to learn. The solution: collect data passively in the background while the user types normally in any application.
13.1 The Idea
A background daemon runs continuously:
1. Listens to the I2S microphone for keystroke onsets
2. Simultaneously reads keyboard events from /dev/input/eventN
3. Matches each audio onset to the nearest key event by timestamp
4. Saves the labeled pair (audio segment + key identity) to disk
Over days of normal use, this accumulates thousands of naturally-labeled samples — enough for a CNN.
┌──────────────┐ ┌──────────────┐
│ I2S Mic │─── onset detect ──▶│ │
│ (always on) │ + 100ms audio │ Timestamp │
└──────────────┘ │ Matcher │──▶ keystroke_data/
│ │ key_a_00142.npy
┌──────────────┐ │ |audio_ts - │ key_e_00143.npy
│ /dev/input/ │─── key events ────▶│ key_ts| │ key_space_00144.npy
│ event0 │ + timestamps │ < 50ms? │ ...
│ (keyboard) │ │ │
└──────────────┘ └──────────────┘
13.2 Timestamp Matching
Both the audio onset and the keyboard event have kernel timestamps. A keystroke produces sound a few milliseconds after the key switch closes (mechanical travel time). The matching window is generous:
MAX_MATCH_MS = 50 # audio onset within 50ms of key event
def match_onset_to_key(audio_ts, key_events):
"""Find the closest key event to an audio onset timestamp."""
best_key = None
best_dt = MAX_MATCH_MS / 1000.0
for key, ts in key_events:
dt = abs(audio_ts - ts)
if dt < best_dt:
best_dt = dt
best_key = key
return best_key # None if no match within window
Why 50 ms? The key switch closes → sound propagates through chassis → reaches microphone. Total delay is 5–30 ms depending on keyboard type. 50 ms gives margin for timing jitter.
13.3 Reading Keyboard Events
The Linux input subsystem (covered in the Input Events tutorial) provides raw key press/release events:
import struct, os
INPUT_EVENT_FORMAT = 'llHHi' # struct input_event
EVENT_SIZE = struct.calcsize(INPUT_EVENT_FORMAT)
EV_KEY = 0x01
KEY_PRESS = 1
# Map Linux keycodes to characters
KEYCODE_MAP = {
30: 'a', 48: 'b', 46: 'c', 32: 'd', 18: 'e', 33: 'f',
34: 'g', 35: 'h', 23: 'i', 36: 'j', 37: 'k', 38: 'l',
50: 'm', 49: 'n', 24: 'o', 25: 'p', 16: 'q', 19: 'r',
31: 's', 20: 't', 22: 'u', 47: 'v', 17: 'w', 45: 'x',
21: 'y', 44: 'z', 57: 'space',
}
def read_key_events(device_path):
"""Generator that yields (key_name, timestamp) for each key press."""
fd = os.open(device_path, os.O_RDONLY)
while True:
data = os.read(fd, EVENT_SIZE)
tv_sec, tv_usec, ev_type, code, value = struct.unpack(
INPUT_EVENT_FORMAT, data)
if ev_type == EV_KEY and value == KEY_PRESS:
key = KEYCODE_MAP.get(code)
if key:
ts = tv_sec + tv_usec / 1e6
yield key, ts
Note
Reading /dev/input/eventN requires the input group or root. The course setup_pi.sh adds the user to this group. See Input Events for details on the input subsystem.
13.4 The Background Daemon
The daemon combines audio onset detection with key event reading in two threads:
#!/usr/bin/env python3
"""background_collector.py — Passive keystroke data collection daemon.
Runs in the background while the user types normally. Matches audio
onsets from the I2S mic to keyboard events from /dev/input/eventN.
Saves labeled audio segments to keystroke_data/ over time.
Run: python background_collector.py --gain 10 --input-device /dev/input/event0
Stop: Ctrl+C or kill the process
"""
import threading, queue, time, os, numpy as np, sounddevice as sd
RATE = 48000
BLOCK_MS = 10
BLOCK = int(RATE * BLOCK_MS / 1000)
COLLECT_MS = 100 # post-onset audio to capture
COLLECT_SAMPLES = int(RATE * COLLECT_MS / 1000)
COOLDOWN_S = 0.4
MAX_MATCH_S = 0.05 # 50ms matching window
# Shared state
key_event_log = [] # [(key_name, timestamp), ...]
log_lock = threading.Lock()
save_dir = "keystroke_data/background"
def audio_callback(indata, frames, time_info, status):
"""Called every 10ms with audio data."""
# ... onset detection + segment collection ...
# When onset detected + segment collected:
# match to nearest key event, save if matched
pass
def key_listener(device_path):
"""Thread: read keyboard events, store with timestamps."""
for key, ts in read_key_events(device_path):
with log_lock:
key_event_log.append((key, ts))
# Keep only last 2 seconds of events
cutoff = time.time() - 2.0
key_event_log[:] = [(k, t) for k, t in key_event_log
if t > cutoff]
Running as a systemd service
For always-on collection, create a systemd user service:
# ~/.config/systemd/user/keystroke-collector.service
[Unit]
Description=Background keystroke audio collector
[Service]
ExecStart=/usr/bin/python3 /home/linux/embedded-linux/scripts/acoustic-keystroke/background_collector.py --gain 10
Restart=on-failure
[Install]
WantedBy=default.target
~/keystroke_data/background/ while you work normally.
13.5 Dataset Growth and Quality Monitoring
Over time, the dataset grows:
Day 1: ~500 labeled samples (casual typing, emails)
Day 3: ~2,000 samples (coding sessions, documentation)
Day 7: ~5,000 samples (enough for CNN baseline)
Day 30: ~20,000+ samples (robust CNN with augmentation)
Monitor collection quality with a simple script:
# check_dataset.py — show collection statistics
import os, numpy as np
from collections import Counter
data_dir = "keystroke_data/background"
counts = Counter()
for f in os.listdir(data_dir):
if f.endswith('.npy'):
key = f.split('_')[1] # key_a_00142.npy → 'a'
counts[key] += 1
print(f"Total samples: {sum(counts.values())}")
print(f"Keys represented: {len(counts)}/27")
print(f"\nPer-key counts:")
for key, n in counts.most_common():
bar = '█' * (n // 10)
print(f" {key:>6}: {n:5d} {bar}")
Class imbalance is expected — 'e' and space appear far more often than 'z' or 'q' in English text. Solutions:
- Oversampling: duplicate rare-key samples during training
- Data augmentation: generate synthetic variants of rare keys
- Class-weighted loss: tell the CNN to penalize rare-key errors more
13.6 Privacy Considerations
Warning
Background keystroke collection raises serious privacy concerns:
- The audio may capture conversations, not just keystrokes
- The key log is literally a keylogger
- Passwords, private messages, and sensitive data pass through
For this course: only collect on your own device with your own typing. Never deploy on shared or public machines without explicit consent.
Good practices: - Save only the 100ms audio segments around detected onsets, not continuous audio - Discard segments that don't match a key event (likely speech) - Store data locally, never transmit over network - Add a visible indicator (LED, tray icon) when collection is active
14. Language Model Post-Processing
Acoustic classification alone achieves 80-95% per-character accuracy. But humans don't type random characters — they type words. A language model can correct acoustic errors by finding the most likely word that matches the noisy predictions.
14.1 The Problem
The acoustic model predicts one key at a time. Some keys sound similar (adjacent on the keyboard → similar resonance). Typical confusions:
Acoustic prediction: "thw quicj bropn fox"
Actual typing: "the quick brown fox"
↑ ↑ ↑
'e'→'w' 'k'→'j' 'w'→'p' confused
Without language correction: 3 errors in 4 words = 25% word error rate. With language correction: "thw" → "the" (obvious), "quicj" → "quick" (one-off) → 0% word error rate.
14.2 How Language Models Help
The acoustic model gives a probability distribution over keys for each keystroke. The language model gives a probability for each word given previous words. Combine them:
Score(word) = Acoustic_score × Language_score
For keystroke sequence [t, h, w]:
Acoustic: P(w|audio) = 0.4, P(e|audio) = 0.35, P(s|audio) = 0.15
Candidate words:
"the" → acoustic: P(t)×P(h)×P(e) = 0.9×0.8×0.35 = 0.252
language: P("the") = 0.07 (very common word)
combined: 0.252 × 0.07 = 0.0176
"thw" → acoustic: P(t)×P(h)×P(w) = 0.9×0.8×0.4 = 0.288
language: P("thw") = 0.00001 (not a word)
combined: 0.288 × 0.00001 = 0.0000029
Winner: "the" (6000× more likely than "thw")
The language model doesn't need to be complex. Even a simple word frequency table ("the" appears in 7% of English text) dramatically reduces errors.
14.3 Simple N-gram Language Model
An n-gram model estimates word probability from the previous n-1 words:
from collections import defaultdict, Counter
class BigramModel:
"""Simple bigram (2-gram) language model.
P(word | previous_word) estimated from text corpus."""
def __init__(self):
self.bigrams = defaultdict(Counter) # prev → {word: count}
self.unigrams = Counter() # word → count
self.total = 0
def train(self, text_file):
"""Train on a text file (one sentence per line)."""
with open(text_file) as f:
for line in f:
words = ['<s>'] + line.lower().split() + ['</s>']
for i in range(1, len(words)):
self.bigrams[words[i-1]][words[i]] += 1
self.unigrams[words[i]] += 1
self.total += 1
def prob(self, word, prev_word='<s>'):
"""P(word | prev_word) with simple smoothing."""
bigram_count = self.bigrams[prev_word][word]
prev_total = sum(self.bigrams[prev_word].values())
if prev_total > 0:
# Interpolate bigram and unigram probabilities
p_bi = bigram_count / prev_total
p_uni = self.unigrams[word] / self.total
return 0.7 * p_bi + 0.3 * p_uni # weighted mix
else:
return self.unigrams.get(word, 1) / self.total
Training data: Any large English text works — Wikipedia dumps, Project Gutenberg books, or even /usr/share/dict/words. For domain-specific use (coding), train on source code.
14.4 Beam Search Decoder
Beam search efficiently finds the most likely word by exploring multiple candidates simultaneously:
def beam_decode(acoustic_probs, lm, beam_width=5, prev_word='<s>'):
"""Decode a sequence of acoustic probability distributions into a word.
acoustic_probs: list of dicts [{char: probability}, ...]
lm: language model with .prob(word, prev_word)
beam_width: number of candidates to keep at each step
"""
# Start with empty candidates
beams = [('', 1.0)] # (partial_word, cumulative_score)
for probs in acoustic_probs:
new_beams = []
for partial, score in beams:
for char, p_acoustic in probs.items():
if p_acoustic < 0.05: # prune unlikely characters
continue
new_word = partial + char
new_score = score * p_acoustic
new_beams.append((new_word, new_score))
# Keep only top beam_width candidates
new_beams.sort(key=lambda x: -x[1])
beams = new_beams[:beam_width]
# Score final candidates with language model
scored = []
for word, acoustic_score in beams:
lm_score = lm.prob(word, prev_word)
combined = acoustic_score * lm_score
scored.append((word, combined, acoustic_score, lm_score))
scored.sort(key=lambda x: -x[1])
return scored[0][0] # best word
14.5 Integration Pipeline
The complete system chains acoustic classification, beam search, and language correction:
Audio stream
│
▼
Onset detection ──► 100ms segment ──► Feature extraction
│ │
│ ▼
│ CNN / SVM prediction
│ (per-key probabilities)
│ │
│ ▼
│ Beam search decoder
│ (acoustic × language model)
│ │
│ ┌───────────────────┘
│ ▼
│ Corrected word
│ │
│ ▼
│ Autocomplete candidates
│ (optional: suggest completions
│ from language model)
│ │
▼ ▼
Display: "the quick brown f|ox"
↑ cursor + suggestions
Connection to Speech Recognition
This architecture — acoustic model + language model + beam search — is exactly how automatic speech recognition (ASR) systems like Whisper, DeepSpeech, and Kaldi work. The acoustic model converts sound to character probabilities, and the language model finds the most likely text. In ASR, the acoustic model is a large neural network; here, it's our simpler CNN. The language model and decoder are identical.
This is why this tutorial is a good stepping stone toward understanding speech recognition pipelines — the core architecture is the same, just at different scales.
14.6 Expected Improvement
| Approach | Per-character accuracy | Word accuracy |
|---|---|---|
| SVM only (20 samples/key) | 75% | ~50% |
| CNN only (200 samples/key) | 90% | ~75% |
| CNN + unigram LM | 90% → 94% corrected | ~88% |
| CNN + bigram LM | 90% → 96% corrected | ~93% |
| CNN + bigram LM + autocomplete | 90% → 98% effective | ~97% |
The language model contributes most when the acoustic model is uncertain. If the CNN is 99% accurate, the LM barely helps. If the CNN is 70% accurate, the LM can recover many words that are obvious from context.
Challenges (Extended)
Tip
Challenge 6: Data Budget Experiment Train the SVM with 10, 20, 50, and 100 samples per key. Plot accuracy vs. dataset size. At what point does accuracy plateau? This tells you the minimum data investment needed for your keyboard + mic setup.
Tip
Challenge 7: CNN vs SVM Learning Curves Plot training and validation accuracy vs. epoch for the CNN. Compare the final accuracy to SVM at the same dataset size. Create a table like Section 11.2 with your actual numbers.
Tip
Challenge 8: Feature Importance Analysis
Train a Random Forest and plot feature_importances_. Which mel bands are most important? Which time frames? Map these back to the spectrogram to understand what acoustic properties distinguish keys. Compare to what the CNN's first-layer filters learned.
Tip
Challenge 9: Cross-Keyboard Transfer Train on one keyboard, test on a different one. How much accuracy drops? Can you improve transfer by: - Using only spectral shape (normalized per-frame) instead of absolute magnitudes? - Fine-tuning the CNN on 5 samples per key from the new keyboard?
Tip
Challenge 10: Real-Time CNN Inference
Modify live_inference.py to use the ONNX model instead of scikit-learn. Measure the inference latency on the Pi. Is the CNN fast enough for real-time? What's the accuracy difference in practice?
Further Reading
References
Acoustic keystroke recognition: - Asonov & Agrawal (2004), Keyboard Acoustic Emanations — the original paper - Zhuang et al. (2009), Keyboard Acoustic Emanations Revisited — improved methods, 96% accuracy - Compagno et al. (2017), Don't Skype & Type! — attack over VoIP
Machine learning foundations: - Andrew Ng, Machine Learning Specialization — free course, excellent intuitive explanations - 3Blue1Brown, Neural Networks — visual explanations of backpropagation
Embedded ML deployment: - ONNX Runtime documentation — deployment on ARM/x86 - TFLite Micro — for MCU deployment (ESP32, STM32) - Pete Warden & Daniel Situnayake, TinyML (O'Reilly) — the reference book for ML on embedded devices
Audio signal processing: - Signal Processing Reference — sampling, FFT, filtering - Julius O. Smith III, Mathematics of the DFT — free online textbook
See also: ML and Signal Processing Reference | Signal Processing Reference | Audio Viz Challenges | I2S Audio Visualizer | Audio Pipeline Latency