Acoustic Keystroke Recognition

Time: 120 min (Sections 1–10) + 60 min extension (Sections 11–12: CNN) | Prerequisites: I2S Audio Visualizer, Python basics

Use the I2S microphone to recognize which key is being typed on a nearby keyboard — purely from audio. This tutorial walks through signal conditioning, feature extraction, classifier training, and real-time inference on the Raspberry Pi.

All source code is in src/embedded-linux/scripts/acoustic-keystroke/.

Why This Works

Every key on a keyboard produces a slightly different sound depending on its position, the mechanical structure underneath, and how it resonates through the chassis. The differences are subtle — you can't hear them — but a spectrogram reveals distinct patterns per key.

Research has demonstrated > 90% accuracy on full keyboards using just a single microphone (Asonov & Agrawal 2004, Zhuang et al. 2009).

Physical model:
  Key press → plunger hits membrane/switch → vibration through chassis
            → propagates to microphone → unique spectral signature

Why keys differ:
  - Position: corner keys have more chassis damping than center keys
  - Mechanism: different spring compression paths
  - Travel distance: spacebar vs letter keys
  - Finger impact: different fingers, different force profiles

Warning

This tutorial is for educational purposes — understanding audio classification, feature extraction, and embedded ML. Acoustic keystroke attacks are a real security concern. Always get consent before recording anyone's typing.

1. Architecture

┌─────────────┐    ┌───────────┐    ┌─────────────┐    ┌──────────┐    ┌──────────┐
│ I2S Mic     │───▶│ Gain +    │───▶│ Onset       │───▶│ Feature  │───▶│ ML Model │
│ (INMP441)   │    │ High-pass │    │ Detection   │    │ Extract  │    │ (SVM /   │
│ 48kHz       │    │ 80Hz HPF  │    │ (energy     │    │ (mel     │    │  RF)     │
│             │    │ 10x gain  │    │  ratio)     │    │  spec)   │    │          │
└─────────────┘    └───────────┘    └─────────────┘    └──────────┘    └──────────┘
                                                                             │
                                                         ┌───────────────────┘
                                                         ▼
                                                   ┌───────────┐
                                                   │ Predicted │
                                                   │ Key: 'a'  │
                                                   └───────────┘

Three phases:

Data collection — Type each key repeatedly while recording audio. Label each keystroke.
Training — Extract features from labeled keystrokes, train a classifier.
Inference — Detect keystrokes in live audio, extract features, classify.

Visual Pipeline Demo

Generate all the visualizations for this tutorial (works with synthetic data, no mic needed):

cd ~/embedded-linux/scripts/acoustic-keystroke
pip3 install numpy matplotlib scipy scikit-learn  # one-time
python visualize_pipeline.py           # saves PNGs
python visualize_pipeline.py -i        # interactive (all plots at once)

This produces 4 figures that illustrate each section below:

1. Onset Detection — Shows the raw waveform with keystroke spikes, the energy-per-block trace, the energy/average ratio with the threshold line, and the 100ms capture windows. You can see exactly how the detector picks out keystrokes from background noise.

2. Keystroke Comparison — 5 different keys shown side by side: waveform, linear spectrogram, and mel spectrogram for each. Notice how each key has a unique resonance pattern — different bright bands at different frequencies. This is what the classifier learns to distinguish.

3. Feature Pipeline — Step-by-step from raw audio through Hann-windowed frames, power spectrogram, mel filterbank, mel spectrogram, to the final flattened feature vector. Annotated to show the attack transient vs resonance decay, and how each step transforms the representation.

4. Feature Space — PCA projection of keystroke features from 5 keys. Each dot is one keystroke. Keys that sound similar cluster together; well-separated clusters are easy to classify. Also shows the average mel spectrogram per key.

2. Setup

Dependencies

# On Pi — collection + inference
sudo apt install libasound2-dev libportaudio2
pip3 install numpy scipy sounddevice

# On host — training (heavier dependencies)
pip3 install numpy scipy scikit-learn sounddevice matplotlib

Verify Microphone

Use mic_test.py to check your mic with a live waveform and optional loopback to headphones:

cd ~/embedded-linux/scripts/acoustic-keystroke

# List audio devices
python mic_test.py --list

# Run with default devices — shows live waveform + level meter
python mic_test.py

# Specify input/output device and boost gain
python mic_test.py -i 4 -o 15 --gain 3.0

Type on the keyboard while watching the waveform. You should see clear spikes for each keystroke. If the signal is barely visible, increase --gain or move the mic closer.

Alternatively, use arecord for a quick check:

arecord -D hw:1,0 -f S32_LE -r 48000 -c 1 -d 5 test.wav
aplay test.wav

3. Signal Conditioning

The raw I2S mic signal is weak (typical keystroke energy is 1e-7) and contaminated by low-frequency rumble from vibrations, air conditioning, etc. Two processing steps bring keystrokes above the noise floor:

3.1 Software Gain

The INMP441 outputs a 24-bit signal that maps to very small float32 values. A 10x software gain brings keystroke transients into a usable range without clipping (keyboard sounds rarely exceed 0.1 even after 10x amplification):

chunk = indata[:, 0].copy()
chunk *= gain  # default: 10.0

3.2 High-Pass Filter

A first-order IIR high-pass filter at 80 Hz removes rumble while preserving the keystroke transients (which are broadband, 100 Hz – 12 kHz):

class HighPass:
    """First-order IIR high-pass filter."""
    def __init__(self, cutoff_hz, rate):
        rc = 1.0 / (2.0 * np.pi * cutoff_hz)
        dt = 1.0 / rate
        self.alpha = rc / (rc + dt)
        self.prev_in = 0.0
        self.prev_out = 0.0

    def process(self, samples):
        out = np.empty_like(samples)
        a = self.alpha
        yi, xi_prev = self.prev_out, self.prev_in
        for i in range(len(samples)):
            xi = samples[i]
            yi = a * (yi + xi - xi_prev)
            xi_prev = xi
            out[i] = yi
        self.prev_in = xi_prev
        self.prev_out = yi
        return out

Why not a higher-order filter?

A first-order filter has a gentle -6 dB/octave slope, which is enough for our purpose. Higher-order filters introduce phase distortion near the cutoff that can smear the keystroke onset — the sharp transient is the most important feature for onset detection.

4. Onset Detection

Onset Detection Pipeline — waveform, energy, ratio, capture windows

Four panels showing the onset detection pipeline: raw waveform with keystroke spikes (top), block energy vs running average (second), energy/average ratio with 5× threshold (third), and 100ms feature extraction windows (bottom). Red vertical lines mark detected onsets. Run locally: cd scripts/acoustic-keystroke && python visualize_pipeline.py (generates all 4 figures) or python visualize_pipeline.py -i (interactive)

A keystroke creates a sharp energy spike against a quiet background. We detect it by comparing short-term energy to a running exponential average:

energy = np.sum(audio ** 2) / len(audio)   # energy of this block
energy_avg = energy_avg * 0.92 + energy * 0.08  # EMA (τ ≈ 120ms)

if energy > energy_avg * threshold:  # default threshold: 5.0
    # keystroke detected!

Key parameters:

Parameter	Default	Effect
`threshold`	5.0	Higher = fewer false positives, may miss soft keystrokes
`min_energy`	1e-5	Absolute floor — ignores noise spikes when everything is quiet
`cooldown`	400 ms	Prevents double-detection (press + release echo)

Training vs Inference Onset Detection

In the training pipeline (features.py), onset detection runs offline over the full recording — it uses a frame-by-frame energy scan with a convolved running average. This is more accurate because it can look forward and backward.

In live inference, detection must be causal (no lookahead). The EMA-based detector runs per audio block (10 ms). When an onset is detected, the system collects 100 ms of audio after the onset before classifying — this captures the resonance tail which contains the most discriminative information.

   onset
     │
     ▼
─────┬──────────────────────────┐
  5ms│       100ms post-onset   │
 pre │  (attack + resonance)    │
─────┴──────────────────────────┘
     ◄──────── feature window ──►

Warning

A common bug: extracting audio before the onset instead of after. The pre-onset audio is just silence/noise — the discriminative spectral content is in the 100 ms resonance tail.

5. Feature Extraction

What Is a Feature?

A machine learning classifier cannot listen to audio the way you do. It needs numbers — a list of values that describe the sound in a way that makes different keys distinguishable. These numbers are called features.

Think of it like describing a person to someone who has never seen them: you wouldn't transmit every pixel of a photo. Instead, you'd say "tall, brown hair, glasses" — a compact description that captures the essential differences. Features are that description for audio.

Why not just feed raw audio samples to the classifier?

A 100 ms keystroke at 48 kHz is 4,800 numbers. You could feed all 4,800 to the classifier, but this is a bad idea:

Too many dimensions. With 4,800 input features and only 20 training examples per key, the classifier has far more parameters to fit than data to learn from — it memorizes the training examples (overfitting) instead of learning generalizable patterns.
Irrelevant information. Most of those 4,800 numbers encode information that doesn't help distinguish keys — the exact phase of the wave, the precise timing of the onset, background noise. The classifier wastes capacity trying to learn patterns in noise.
Not shift-invariant. If the onset is detected 1 ms earlier in one sample, all 4,800 values shift. The classifier sees this as a completely different pattern, even though it's the same keystroke.

Good features solve all three problems: they're compact (hundreds, not thousands of values), they capture only the discriminative information (which frequencies are present, how they decay), and they're invariant to irrelevant variation (onset timing, overall volume).

Raw audio (bad features):         Mel spectrogram (good features):

4,800 numbers                     1,600 numbers (~3x smaller)
├─ exact waveform shape            ├─ frequency content (which resonances)
├─ onset timing                    ├─ temporal evolution (how they decay)
├─ phase (irrelevant)              └─ compressed frequency axis (mel scale)
├─ noise samples                       ↓
└─ overall amplitude               Invariant to phase, onset shift,
    ↓                               and background noise level
Everything is relevant to
the classifier → overfitting

The process of converting raw sensor data into good features is called feature extraction (or feature engineering). It's the single most important step in classical ML — the classifier can only find patterns in what you give it.

Features in Other Domains

This concept applies everywhere, not just audio:

Domain	Raw data	Good features
Audio (this tutorial)	4,800 samples	Mel spectrogram (32 × 50)
Images	640 × 480 pixels	HOG descriptors, color histograms
IMU sensor	Accelerometer time series	Mean, std, FFT peaks, zero crossings
Text	Raw characters	Word embeddings, TF-IDF vectors
Network traffic	Packet bytes	Flow statistics, port distributions

In deep learning (Section 11), the CNN learns its own features from the mel spectrogram — the first convolutional layers automatically discover patterns like "energy at 800 Hz decaying over 20 ms." But this requires much more data. With small datasets, hand-crafted features + classical ML wins because your domain knowledge compensates for limited examples.

What Makes a Good Feature for Keystrokes?

Look at the spectrograms of different keys:

Different Keys Produce Different Spectral Signatures

Five keys compared: waveform (top), linear spectrogram (middle), and mel spectrogram (bottom). Notice how each key excites different resonance frequencies — 'a' has strong energy around 800 Hz, while 'k' resonates higher around 2 kHz. These spectral differences are what the classifier learns to distinguish. Run locally: python scripts/acoustic-keystroke/visualize_pipeline.py

Each key has a unique spectral signature — a pattern of which frequencies are excited and how they decay over time. This is determined by the physical properties of the key's position on the keyboard:

Position affects which part of the chassis resonates (corner keys sound different from center keys)
Key mechanism affects the attack transient (the initial click)
Finger affects the impact force and angle

The mel spectrogram captures exactly this: which frequencies (vertical axis) are present at each moment (horizontal axis). It's the natural representation for "what does this sound look like?"

Tip

See it interactively: Run the mel spectrogram explorer on your host PC:

cd ~/embedded-linux/scripts/signal-processing-demo
python mel_spectrogram_explorer.py -i

Drag the sliders to see how FFT size, hop size, and number of mel bands affect the output.

5.1 Mel Spectrogram — How It's Built

The core feature is a mel-scaled spectrogram — a time-frequency representation that emphasizes the frequency ranges where keystroke differences are most apparent.

Step 1: STFT — Split the segment into overlapping frames, apply a Hanning window, compute the FFT of each frame:

# Parameters
N_FFT = 512        # ~10.7ms frame at 48kHz
hop = 96            # 2ms hop → ~50 frames per keystroke
hann = np.hanning(N_FFT)

for i in range(n_frames):
    frame = segment[i * hop : i * hop + N_FFT]
    fft = np.abs(rfft(frame * hann)) ** 2  # power spectrum
    spec[:, i] = fft

Step 2: Mel filterbank — Apply triangular filters spaced according to the mel scale. This compresses high frequencies (where our ears and keystroke physics are less discriminative) while preserving detail in the low-mid range:

Hz:    100    200   400   800   1.6k   3.2k   6.4k   12k
        │      │     │     │      │      │      │      │
Mel:    ├──┤├──┤├──┤├───┤├────┤├──────┤├──────────┤├────────┤
        Lots of detail here        Less detail here
          (100-2kHz)                  (2-12kHz)

The mel scale conversion:

\[\text{mel}(f) = 2595 \cdot \log_{10}\left(1 + \frac{f}{700}\right)\]

We use 32 mel bands from 100 Hz to 12 kHz. Below 100 Hz is rumble (already filtered), above 12 kHz is hiss with no useful keystroke information.

Warning

Why not more mel bands? With FFT size 512 (our default), there are only 257 frequency bins. If you increase mel bands beyond ~32, the upper triangular filters become so narrow that adjacent centers map to the same FFT bin — the filter captures zero energy, producing black lines in the spectrogram. This is not a bug; it's a fundamental resolution limit: you can't have more mel bands than FFT bins support. To use more bands, increase the FFT size (e.g., 64 bands needs FFT ≥ 1024). Try it in the mel_spectrogram_explorer.py -i demo — the info panel warns when this happens.

Step 3: Log compression — Convert power to decibels. This compresses the dynamic range and makes the features more Gaussian (better for classifiers):

mel_spec = fb @ spec                    # (32, ~50) matrix
mel_spec = 10 * np.log10(mel_spec + 1e-10)  # → dB scale

Feature Extraction Pipeline — From Audio to ML Input

The complete feature extraction pipeline: raw keystroke with annotated attack/resonance (1), Hann-windowed overlapping frames (2), power spectrogram from STFT (3), triangular mel filterbank showing dense low-freq and sparse high-freq bands (4), mel spectrogram (5), and the final flattened feature vector ready for SVM/CNN input (6). Run locally: python scripts/acoustic-keystroke/visualize_pipeline.py

5.2 Amplitude Normalization

Different recording sessions may have different mic positions or gain settings. Normalizing each keystroke segment to unit peak amplitude makes the features invariant to overall volume:

peak = np.max(np.abs(segment))
if peak > 1e-6:
    segment = segment / peak

This is critical — without it, a model trained at one mic distance fails at another.

5.3 Additional Features

Beyond the raw mel spectrogram, we extract three supplementary features that capture temporal dynamics:

Feature	Shape	What it captures
Temporal envelope	(n_frames,)	Mean mel energy per frame — the attack/decay shape
Spectral centroid	(n_frames,)	Which frequency band dominates at each time step
Delta (first derivative)	(32, n_frames-1)	How the spectrum changes between frames

# Temporal envelope: energy contour
frame_energy = np.mean(mel_spec, axis=0)

# Spectral centroid: "brightness" over time
centroid = Σ(freq × power) / Σ(power)  # per frame

# Delta: spectral change rate
delta = np.diff(mel_spec, axis=1)

The final feature vector concatenates all four: mel_spec.flatten() + envelope + centroid + delta.flatten() → ~3100 features per keystroke.

Why not MFCC?

MFCCs (Mel-Frequency Cepstral Coefficients) are the standard in speech recognition — they apply a DCT to decorrelate the mel bands. For keystroke recognition, we found that the raw mel spectrogram + delta features work well enough, and the implementation is simpler (no librosa dependency). If you want to experiment, see the Alternatives section.

6. Data Collection

Two collection methods are provided, each with different strengths:

6.1 Per-Key Collection (`collect_keystrokes.py`)

Press each key 20+ times in isolation. Simple but produces "robotic" data:

python collect_keystrokes.py --gain 10 --keys abcdefghij --presses 30

Features:

Mic check at startup — shows live level meter, warns if signal is weak
Live progress bar with SNR per detection
80 Hz high-pass filter applied during recording
Per-key summary with average SNR

6.2 Typing Practice (`typing_practice.py`)

Type natural phrases — pangrams, common words, key-focused drills:

python typing_practice.py --gain 10 --rounds 10

Features:

Shows phrases to type with green/red feedback
Records audio continuously + timestamps each keypress via raw terminal input
Mistyped keys are excluded from training
Produces more natural data — variable speed, finger transitions

Tip

Best approach: Use per-key collection for initial baseline, then augment with typing practice data. Both formats are loaded automatically by train_model.py.

6.3 Workflow: Collect on Pi, Train on Host

The Pi has the I2S mic but is slow for training (SVM with 3000+ samples and 3100 features takes minutes). The recommended workflow:

┌──────────────┐  scp   ┌──────────────┐  scp   ┌──────────────┐
│ Raspberry Pi │ ─────▶ │ Host PC      │ ─────▶ │ Raspberry Pi │
│              │        │              │        │              │
│ 1. Collect   │        │ 2. Train     │        │ 3. Inference │
│    data      │        │    model     │        │    (live)    │
└──────────────┘        └──────────────┘        └──────────────┘

Step 1 — Collect on Pi (has the I2S mic):

# On Pi
cd ~/embedded-linux/scripts/acoustic-keystroke
python collect_keystrokes.py --gain 10 --presses 30
python typing_practice.py --gain 10 --rounds 10

Step 2 — Train on host (faster CPU, more RAM):

# Copy data from Pi to host
scp -r pi@raspberrypi:~/embedded-linux/scripts/acoustic-keystroke/keystroke_data ./

# Train (much faster on host)
python train_model.py --augment keystroke_data

Step 3 — Deploy back to Pi:

# Copy model back to Pi
scp keystroke_model.pkl pi@raspberrypi:~/embedded-linux/scripts/acoustic-keystroke/

# Run inference on Pi
ssh pi@raspberrypi
cd ~/embedded-linux/scripts/acoustic-keystroke
python live_inference.py --gain 10

Tip

Quick iteration: Use rsync instead of scp to sync only changed files:

# Sync data from Pi
rsync -av pi@raspberrypi:~/embedded-linux/scripts/acoustic-keystroke/keystroke_data/ ./keystroke_data/
# Push model back
rsync -av keystroke_model.pkl pi@raspberrypi:~/embedded-linux/scripts/acoustic-keystroke/

6.4 Collection Tips

Place mic 5–10 cm from keyboard, same position each session
Mechanical keyboards produce stronger signals than membrane
Minimize background noise — fan, music, talking all hurt SNR
If SNR < 8x, move mic closer or increase --gain
Collect at least 30 samples per key for decent accuracy

7. Training the Classifier

Keystroke Feature Space — PCA projection

PCA projection of keystroke features from 5 keys (30 samples each). Each dot is one keystroke. Well-separated clusters (like 'a' vs 'space') are easy to classify; overlapping clusters would confuse the model. Right: average mel spectrogram per key shows why they cluster — each key has a distinct spectral signature. Run locally: python scripts/acoustic-keystroke/visualize_pipeline.py (requires scikit-learn)

7.1 The Pipeline

Training uses scikit-learn's Pipeline — a chain of preprocessing + classifier that ensures the same transformations are applied at training and inference time:

pipeline = Pipeline([
    ("scaler", StandardScaler()),   # normalize features to zero mean, unit variance
    ("clf", SVC(kernel='rbf', ...))  # the actual classifier
])

The StandardScaler is critical because our features have very different scales — mel spectrogram values are in dB (–100 to 0), centroids are in band indices (0–32), deltas are small differences. Without scaling, features with large numeric ranges dominate the classifier.

7.2 Support Vector Machine (SVM)

Tip

See it interactively: Run the ML decision boundary demo to visualize how SVM, Random Forest, and k-NN work on 2D data:

python ~/embedded-linux/scripts/signal-processing-demo/ml_decision_boundary.py -i

Drag the "Samples" slider to see how more data sharpens the decision boundary. Switch between classifiers. This is exactly what happens in 3100D with your keystroke features — you just can't plot it.

The default classifier is an SVM with RBF (Radial Basis Function) kernel. Here's why it works well for this problem:

What SVM does: Finds a decision boundary that maximizes the margin between classes. In high-dimensional feature space (~3100 dimensions), there's usually a separable hyperplane even for 27 classes.

The RBF kernel: Maps features into an infinite-dimensional space where linear separation becomes possible. The kernel function measures similarity between two feature vectors:

\[K(x_i, x_j) = \exp\left(-\gamma \|x_i - x_j\|^2\right)\]

Two keystrokes that produce similar spectrograms have high kernel value → classified together.

Key hyperparameters:

Parameter	Value	Role
`C`	10	Regularization — higher allows more complex boundaries, risk of overfitting
`gamma`	'scale'	RBF width — auto-scales to `1 / (n_features × variance)`
`probability`	True	Enables confidence scores (needed for thresholding in live inference)

SVM vs Neural Networks

For this dataset size (500–3000 samples, ~3100 features), SVM typically outperforms simple neural networks. SVMs are mathematically guaranteed to find the maximum-margin solution, while neural networks can get stuck in local minima. Deep learning (CNNs) only wins when you have 10,000+ samples and can learn features end-to-end.

7.3 Cross-Validation

We evaluate with 5-fold stratified cross-validation — the dataset is split into 5 parts, the model trains on 4 and tests on the held-out 1, rotating 5 times:

Fold 1:  [TEST] [train] [train] [train] [train]  → 78%
Fold 2:  [train] [TEST] [train] [train] [train]  → 82%
Fold 3:  [train] [train] [TEST] [train] [train]  → 74%
Fold 4:  [train] [train] [train] [TEST] [train]  → 80%
Fold 5:  [train] [train] [train] [train] [TEST]  → 79%
                                          Mean:     78.6%

This gives an honest accuracy estimate. If training accuracy is 100% but CV accuracy is 65%, the model is overfitting — memorizing training data rather than learning generalizable patterns. The gap should be < 15%.

7.4 Data Augmentation

Generate synthetic training samples from existing data to improve generalization:

def augment_segment(segment):
    augmented = []
    # Time shift: ±3ms (simulates slightly different onset alignment)
    shift = int(RATE * 0.003)
    augmented.append(np.roll(segment, shift))
    augmented.append(np.roll(segment, -shift))

    # Additive noise (simulates different noise floors)
    augmented.append(segment + np.random.randn(len(segment)) * 0.002)

    # Gain variation ±15% (simulates different mic distances)
    augmented.append(segment * 1.15)
    augmented.append(segment * 0.85)

    return augmented  # 5 extra samples per original

With --augment, each keystroke produces 6 training samples (1 original + 5 augmented), which typically improves accuracy by 5–10%.

7.5 Training Commands

# Train with SVM (default, recommended)
python train_model.py --augment keystroke_data

# Train with Random Forest (faster, slightly lower accuracy)
python train_model.py --model rf --augment keystroke_data

# Output includes confusion pairs — which keys get mixed up

8. Real-Time Inference

8.1 How It Works

The inference loop runs in a sounddevice audio callback at 48 kHz with 10 ms blocks:

State machine:
                         ┌──────────────────┐
      ┌─────────────────▶│   IDLE           │
      │                  │ (monitor energy)  │
      │                  └────────┬─────────┘
      │                           │ energy > threshold × avg
      │                           │ AND energy > min_energy
      │                           ▼
      │                  ┌──────────────────┐
      │                  │   COLLECTING     │
      │                  │ (buffer 100ms    │
      │                  │  post-onset)     │
      │                  └────────┬─────────┘
      │                           │ 100ms collected
      │                           ▼
      │                  ┌──────────────────┐
      │                  │   CLASSIFY       │
      │                  │ extract_features()│
      │                  │ model.predict()  │
      │                  └────────┬─────────┘
      │                           │
      │                  ┌────────▼─────────┐
      └──────────────────│   COOLDOWN       │
                         │ (400ms silence)  │
                         └──────────────────┘

Critical: the system collects audio after the onset, not before. The onset triggers collection; classification happens 100 ms later once the full keystroke resonance is captured.

8.2 Running

# Normal mode — predicted keys appear as you type
python live_inference.py --gain 10

# Debug mode — shows energy, detections, confidence scores
python live_inference.py --gain 10 --debug

# Tune detection sensitivity
python live_inference.py --gain 10 --threshold 3.0 --min-energy 5e-6

8.3 Troubleshooting

Symptom	Cause	Fix
No detections at all	Signal too weak	Increase `--gain`, move mic closer
Ghost detections on silence	min-energy too low	Increase `--min-energy`
Double detections per keystroke	Cooldown too short	Increase `--cooldown-ms 500`
Detects but always wrong key	Model accuracy too low	Re-collect data, more samples, fewer keys
Low confidence (< 20%)	Features don't match training	Ensure same `--gain` in collection and inference

9. How the Models Work — In Depth

9.1 SVM with RBF Kernel (Default)

Intuition: Imagine each keystroke as a point in 3100-dimensional space. Keys that sound similar cluster together. SVM draws boundaries between clusters with maximum clearance (margin).

The kernel trick: In the original feature space, clusters may overlap. The RBF kernel implicitly maps points to a higher-dimensional space where they become separable:

Original space (overlapping):     After RBF mapping (separable):

  ●●  ○○○  ●●                     ●●
   ○○●●○                               ○○○
  ●○○●●                                    ●●●
                                   ○○

Multi-class strategy: SVM is inherently binary. For 27 keys, scikit-learn uses one-vs-one — trains 27×26/2 = 351 binary classifiers, each distinguishing one pair of keys. Final prediction is by majority vote.

Probability calibration: SVM outputs are distances from the decision boundary, not probabilities. With probability=True, Platt scaling fits a sigmoid to convert distances → probabilities. This is what enables the confidence threshold in live inference.

Tradeoffs:

	SVM (RBF)	Random Forest
Accuracy	Higher (75–90%)	Good (70–85%)
Training time	Slower (O(n²) with n samples)	Faster
Inference time	Fast (few support vectors)	Fast (tree traversal)
Interpretability	Low (black box)	Medium (feature importance)
Data requirements	Works with 20+ samples/class	Needs 30+ samples/class

9.2 Random Forest (Alternative)

How it works: Trains 200 independent decision trees, each on a random subset of the data and features. Each tree votes for a class; the majority wins.

Why it's useful:

Fast to train and predict
Provides feature_importances_ — shows which mel bands and time frames matter most
Less prone to overfitting than a single decision tree
No hyperparameter tuning needed (works out of the box)

# Train with Random Forest
python train_model.py --model rf --augment

9.3 Why Not Deep Learning?

A CNN could learn features end-to-end from raw spectrograms and likely reach 90%+ accuracy. But for this tutorial:

Data size: We have ~20–50 samples per class. CNNs need thousands.
Complexity: A CNN requires PyTorch/TensorFlow, GPU for training, and careful architecture design.
Inference cost: On a Pi, scikit-learn prediction takes < 1 ms. A CNN takes 10–50 ms.
Explainability: With mel features + SVM, you can inspect what the model sees. With a CNN, it's opaque.

If you have enough data (500+ samples per key) and want maximum accuracy, see Alternatives.

10. Visualization with Audio Viz

Use the audio visualizer to see the keystroke patterns. Run audio_viz_full and type on the keyboard near the microphone:

Waveform — Sharp spikes appear at each keystroke
Spectrogram — Vertical lines (broadband transients) with distinct frequency patterns
Spectrum — Different keys excite different frequency peaks

This is a great way to build intuition before diving into the ML pipeline. The TDOA overlay also shows interesting effects — each key produces sound from a slightly different position on the keyboard.

Alternatives to Explore

Tip

MFCC Features Replace the mel spectrogram with MFCCs — apply a Discrete Cosine Transform to the mel bands, keep the first 13 coefficients. This decorrelates the features and works better with Gaussian classifiers (GMM, linear SVM). Requires librosa or a manual DCT implementation.

Tip

CNN on Spectrograms Treat each keystroke's mel spectrogram as a (32 × 50) grayscale image. Use a small CNN (2 conv layers, pool, flatten, dense). With PyTorch:

class KeyCNN(nn.Module):
    def __init__(self, n_classes):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 16, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
        )
        self.fc = nn.Sequential(
            nn.Linear(32 * 8 * 12, 64), nn.ReLU(),
            nn.Linear(64, n_classes)
        )

Tip

Gaussian Mixture Model (GMM) Train one GMM per key (like speaker verification). At inference, compute the log-likelihood of the keystroke under each GMM; pick the highest. Works well with MFCC features and adapts to new keys without full retraining.

Tip

k-Nearest Neighbors (k-NN) The simplest classifier — compare a new keystroke to all training examples, pick the most common class among the k nearest. No training needed, but slow at inference and accuracy depends entirely on feature quality. Good for quick experiments.

Tip

Embedded Deployment (ONNX) Export the trained scikit-learn model to ONNX format for C/C++ inference:

pip install skl2onnx onnxruntime

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
onnx_model = convert_sklearn(pipeline, initial_types=[
    ("features", FloatTensorType([None, n_features]))])

This enables inference without Python — useful for integrating with the SDL2 audio visualizer.

Challenges

Tip

Challenge 1: Confusion Matrix After training, print a confusion matrix (sklearn.metrics.confusion_matrix). Which keys are most often confused? Do confused keys share physical proximity on the keyboard?

Tip

Challenge 2: SDL2 Live Display Extend audio_viz_full.c to show the predicted key on screen. Add a text field that accumulates predicted characters. Hint: run the Python inference script as a subprocess and read its stdout.

Tip

Challenge 3: Two-Microphone Improvement Use stereo capture (-c 2) to add TDOA-based position estimation as an extra feature. Each key has a different position → different arrival time difference. Does this improve accuracy?

Tip

Challenge 4: Security Implications Write a 1-page analysis: if this attack works at 80% accuracy, what are the implications? How would you defend against it? Consider: noise injection, randomized key sounds, on-screen keyboards, keystroke timing randomization.

Tip

Challenge 5: Cross-Session Robustness Collect training data on day 1, test on day 2 with slightly different mic placement. How much does accuracy drop? Experiment with adding data augmentation (gain variation, noise injection) to improve robustness.

11. From Classical ML to Deep Learning

The SVM and Random Forest classifiers work well with hand-crafted features, but they have a ceiling: you must decide what features to extract (mel bands, deltas, centroids), and the classifier can only work with what you give it. Deep learning flips this — the model learns its own features directly from the spectrogram.

This section builds a CNN (Convolutional Neural Network) that takes the raw mel spectrogram as input and learns to recognize keystrokes end-to-end. We'll see when this helps, when it doesn't, and how to deploy it on the Pi.

11.1 Why a CNN for Audio?

A mel spectrogram is a 2D matrix — frequency bins on one axis, time frames on the other. This is structurally identical to a grayscale image. CNNs excel at learning local patterns in images:

Mel spectrogram (32 × 50):      What the CNN learns:

  ░░▓▓▓░░░░░░░░░░░░░░░          "Bright band at 800 Hz
  ░░▓▓▓▓▓░░░░░░░░░░░░░           that fades over 20 ms"
  ░░░▓▓▓▓▓▓░░░░░░░░░░░           = key 'a'
  ░░░░▓▓▓▓▓▓▓░░░░░░░░░
  ░░░░░░▓▓▓▓▓▓░░░░░░░░          "Two bands at 400+2k Hz
  ░░░░░░░░▓▓▓▓▓░░░░░░░           with fast decay"
  ▲                               = key 's'
  freq
       time ──────────▶

The first convolutional layers learn spectro-temporal patterns — combinations of frequency bands that activate together over specific time windows. Deeper layers combine these into higher-level representations. The final layer maps these to key classes.

Convolution in Audio vs Images

In image CNNs, convolution kernels slide over 2D spatial dimensions (height × width). In our audio CNN, the two dimensions are frequency (mel bands) and time (frames). A 3×3 kernel learns relationships between adjacent frequency bands over 3 consecutive time frames — perfect for capturing the spectral evolution of a keystroke.

This is different from 1D CNNs used in raw waveform processing, where convolution operates only over the time axis. The 2D approach leverages the spectrogram's structure.

11.2 When Does CNN Beat SVM?

Not always. The tradeoff depends on data quantity:

Samples per key	SVM accuracy	CNN accuracy	Winner
20	70–80%	40–60%	SVM — CNN overfits badly
50	75–85%	65–80%	SVM — still not enough data
100	78–88%	80–88%	Tied
200+	80–90%	88–95%	CNN — learned features surpass hand-crafted
500+	82–90%	92–97%	CNN — significant advantage

Key insight: SVM with hand-crafted features has a lower data floor (works with 20 samples) but a lower accuracy ceiling. CNN has a higher data floor (needs 100+ samples) but a higher ceiling. This is a fundamental pattern in ML — it's not that one method is "better," it's about the data regime you're operating in.

The Bias-Variance Tradeoff

This is one of the most important concepts in machine learning. Every model makes a tradeoff:

High bias (underfitting): The model is too simple to capture the patterns. SVM with bad features has this problem — it can't learn what it can't see.
High variance (overfitting): The model is too complex for the data. A CNN with 20 samples memorizes each example instead of learning generalizable patterns.

With small data → prefer simpler models (SVM, RF). With large data → complex models (CNN) can learn richer representations. Adding regularization (dropout, data augmentation) shifts the tradeoff, letting complex models work with less data.

Error
 │    ╲  Total error
 │     ╲  ╱
 │      ╳
 │     ╱ ╲
 │ Bias    Variance
 │╱         ╲
 └──────────────────▶ Model complexity
       SVM    CNN

11.3 Collecting More Data

To give the CNN enough data, collect 50+ samples per key using both collection methods:

# Per-key collection: 50 presses per key for all lowercase + space
python collect_keystrokes.py --gain 10 --presses 50 --keys "abcdefghijklmnopqrstuvwxyz "

# Typing practice: 20 rounds of natural typing
python typing_practice.py --gain 10 --rounds 20

Tip

Data quality matters more than quantity. 50 clean samples beats 200 noisy ones. Before collecting: - Minimize background noise (close windows, turn off fans) - Keep mic position fixed - Type at your normal speed (don't artificially slow down) - Verify with mic_test.py that SNR > 10x before starting

11.4 Building the CNN

We use PyTorch for the CNN because it's explicit — you see every layer, every dimension, every operation. No "magic."

Setup

# On host (training) — Pi is too slow for CNN training
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Network Architecture

The CNN is deliberately small — 2 conv layers, then a fully connected classifier. This is enough for 32×50 spectrograms with 27 classes:

import torch
import torch.nn as nn

class KeystrokeCNN(nn.Module):
    def __init__(self, n_classes, n_mels=32, n_frames=50):
        super().__init__()

        # Feature extraction: two conv blocks
        self.features = nn.Sequential(
            # Block 1: 1 → 16 channels, 3×3 conv + ReLU + pool
            nn.Conv2d(1, 16, kernel_size=3, padding=1),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(2),  # (16, 16, 25)

            # Block 2: 16 → 32 channels
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),  # (32, 8, 12)
        )

        # Classifier
        flat_size = 32 * (n_mels // 4) * (n_frames // 4)
        self.classifier = nn.Sequential(
            nn.Dropout(0.3),        # regularization
            nn.Linear(flat_size, 64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, n_classes)
        )

    def forward(self, x):
        # x shape: (batch, 1, n_mels, n_frames)
        x = self.features(x)
        x = x.view(x.size(0), -1)  # flatten
        x = self.classifier(x)
        return x

Layer by Layer

Conv2d(1, 16, 3): Takes the single-channel spectrogram and applies 16 different 3×3 filters. Each filter learns to detect a different spectro-temporal pattern (e.g., "energy at 1 kHz decaying over 2 frames"). Output: 16 feature maps of the same size.

BatchNorm2d(16): Normalizes each feature map to zero mean and unit variance. This stabilizes training — without it, deeper layers see wildly varying input ranges and learn slowly.

ReLU: max(0, x) — zeroes out negative activations. This introduces non-linearity, allowing the network to learn complex patterns. Without it, stacking linear layers would be equivalent to a single linear layer.

MaxPool2d(2): Takes every 2×2 block and keeps only the maximum. This halves the spatial dimensions (32×50 → 16×25), making the network invariant to small shifts in onset timing or frequency.

Dropout(0.3): Randomly zeroes 30% of activations during training. Forces the network to not rely on any single neuron — a powerful regularizer that prevents overfitting, especially critical with small datasets.

Linear(flat_size, 64): Fully connected layer that combines all the learned features into 64 abstract representations.

Linear(64, n_classes): Final layer — 27 outputs (one per key). The highest output is the predicted class.

Training Loop

def train_cnn(X, y, n_classes, epochs=50, lr=0.001, batch_size=32):
    """Train CNN on mel spectrogram data.

    X: numpy array (n_samples, n_mels, n_frames)
    y: numpy array of integer labels (0..n_classes-1)
    """
    # Convert to PyTorch tensors
    X_tensor = torch.FloatTensor(X).unsqueeze(1)  # add channel dim
    y_tensor = torch.LongTensor(y)
    dataset = torch.utils.data.TensorDataset(X_tensor, y_tensor)

    # 80/20 train/val split
    n_val = len(dataset) // 5
    n_train = len(dataset) - n_val
    train_set, val_set = torch.utils.data.random_split(
        dataset, [n_train, n_val])

    train_loader = torch.utils.data.DataLoader(
        train_set, batch_size=batch_size, shuffle=True)
    val_loader = torch.utils.data.DataLoader(
        val_set, batch_size=batch_size)

    model = KeystrokeCNN(n_classes)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        # Training
        model.train()
        train_loss = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            output = model(X_batch)
            loss = criterion(output, y_batch)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()

        # Validation
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                output = model(X_batch)
                _, predicted = torch.max(output, 1)
                total += y_batch.size(0)
                correct += (predicted == y_batch).sum().item()

        val_acc = correct / total if total > 0 else 0
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}/{epochs}  "
                  f"loss={train_loss/len(train_loader):.3f}  "
                  f"val_acc={val_acc:.1%}")

    return model

What Happens During Training

Each epoch passes through the entire training set once. In each step:

Forward pass: input spectrogram → conv layers → predicted class probabilities
Loss calculation: CrossEntropyLoss measures how far the prediction is from the true label. If the model is confident and correct → low loss. Confident and wrong → high loss.
Backward pass (backpropagation): Compute the gradient of the loss with respect to every weight in the network. This tells each weight "which direction should I change to reduce the loss?"
Optimizer step: Adam updates each weight by a small amount in the gradient direction. The learning rate (0.001) controls how big each step is.

Over 50 epochs, the weights gradually adjust until the network correctly classifies most training examples. The validation accuracy tells us if this generalizes to unseen data.

Signs of trouble: - Training accuracy high, validation low → overfitting (need more data or more dropout) - Both accuracies plateau early → underfitting (need more capacity or lower learning rate) - Loss oscillates wildly → learning rate too high (reduce by 10x)

11.5 SVM vs CNN: A Fair Comparison

Run both on the same dataset to see the difference:

# Train SVM (baseline)
python train_model.py --augment keystroke_data
# → "SVM accuracy: 82.3% (±3.1%)"

# Train CNN (compare)
python train_model.py --model cnn --augment keystroke_data
# → "CNN accuracy: 87.5% (±2.8%)"  (with 100+ samples/key)

The training script handles both models through the --model flag. The CNN uses the raw mel spectrogram (32×50 matrix) while the SVM uses the flattened spectrogram + extra features (3100-element vector).

Note

Why the CNN might not win with small data: With 20 samples per key, the CNN has ~50,000 parameters to learn from ~540 examples. That's 92 parameters per example — severe overfitting is guaranteed. The SVM, by contrast, has a mathematically principled regularization (the margin) that works even with very few samples.

The real lesson: Neither model is universally "better." The right choice depends on your data budget, latency requirements, and deployment constraints. This is true across all of ML — not just keystroke recognition.

11.6 Deploying the CNN on Raspberry Pi

The trained PyTorch model can't run efficiently on the Pi's ARM CPU. We export it to ONNX (Open Neural Network Exchange) format and run inference with the lightweight ONNX Runtime:

Export

import torch

# Load trained model
model = KeystrokeCNN(n_classes)
model.load_state_dict(torch.load("keystroke_cnn.pt"))
model.eval()

# Export to ONNX
dummy = torch.randn(1, 1, 32, 50)  # batch=1, channels=1, mels=32, frames=50
torch.onnx.export(model, dummy, "keystroke_cnn.onnx",
                  input_names=["spectrogram"],
                  output_names=["logits"],
                  dynamic_axes={"spectrogram": {0: "batch"}})

Install ONNX Runtime on Pi

pip3 install onnxruntime  # ~5 MB, pure C++ inference engine

Inference

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("keystroke_cnn.onnx")

def predict_cnn(mel_spec):
    """mel_spec: numpy array (32, 50)"""
    input_data = mel_spec[np.newaxis, np.newaxis, :, :].astype(np.float32)
    logits = session.run(None, {"spectrogram": input_data})[0]
    predicted = np.argmax(logits, axis=1)[0]
    confidence = np.exp(logits[0]) / np.sum(np.exp(logits[0]))  # softmax
    return predicted, confidence[predicted]

Latency Comparison

	SVM (scikit-learn)	CNN (ONNX Runtime)
Model size	~2 MB (.pkl)	~0.5 MB (.onnx)
Inference time (Pi 4)	< 1 ms	~3 ms
Inference time (Pi Zero)	~5 ms	~15 ms
Dependencies	scikit-learn (~50 MB)	onnxruntime (~5 MB)
Training	Host CPU, minutes	Host CPU, minutes (GPU: seconds)

Both are well within the real-time budget — onset detection gives us 100 ms of collection time plus 400 ms cooldown, and inference takes < 5 ms even on the slowest Pi.

Why ONNX and Not TFLite?

Both are valid deployment formats. ONNX has better scikit-learn interoperability (via skl2onnx) and a simpler Python API. TFLite is better if you're using TensorFlow/Keras for training and want int8 quantization for MCU deployment. For Pi-class hardware, ONNX Runtime is the simpler path.

For MCU deployment (ESP32, STM32), TFLite Micro with int8 quantization would be the right choice — but that's beyond the scope of this tutorial.

12. Understanding What the CNN Learns

One criticism of deep learning is that it's a "black box." But we can peek inside.

12.1 Visualizing Filters

The first conv layer's 16 filters show what low-level patterns the network learned:

# Extract first conv layer weights
weights = model.features[0].weight.detach().numpy()  # (16, 1, 3, 3)

# Plot as 16 small heatmaps
fig, axes = plt.subplots(2, 8, figsize=(12, 3))
for i, ax in enumerate(axes.flat):
    ax.imshow(weights[i, 0], cmap='RdBu', vmin=-0.5, vmax=0.5)
    ax.set_title(f'F{i}')
    ax.axis('off')

Typical patterns you'll see: - Horizontal edges: frequency band boundaries (specific resonances) - Vertical edges: onset transients (the "click" of the keypress) - Diagonal patterns: frequency sweeps (resonance decay)

12.2 Confusion Heatmap

Which keys does the CNN confuse?

from sklearn.metrics import confusion_matrix
import seaborn as sns

# Run predictions on validation set
y_pred = model_predict_all(X_val)
cm = confusion_matrix(y_val, y_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d',
            xticklabels=key_names, yticklabels=key_names)
plt.xlabel('Predicted')
plt.ylabel('True')

Expected patterns: - Adjacent keys (e.g., 'f'/'g') are most confused — similar position, similar resonance - Keys pressed with the same finger (e.g., 'q'/'a'/'z') share activation patterns - Spacebar and Enter are rarely confused with letter keys (very different mechanics)

12.3 t-SNE: Visualizing the Feature Space

t-SNE reduces the CNN's internal representation from 64 dimensions to 2D for visualization:

from sklearn.manifold import TSNE

# Get features from the layer before the final classifier
features = model.get_intermediate_features(X_all)  # (n_samples, 64)

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
embedded = tsne.fit_transform(features)

plt.scatter(embedded[:, 0], embedded[:, 1], c=y_all, cmap='tab20', s=5)

If the model works well, you'll see distinct clusters — one per key. Keys that the model confuses will have overlapping clusters. This visualization immediately shows: - Which keys are easy — well-separated clusters - Which keys are hard — overlapping clusters - Whether more data would help — if clusters have clear structure but thin boundaries, more data would sharpen them

13. Background Data Collection

Manual collection ("press A 50 times") is tedious and produces robotic, unnatural data. Real typing has variable speed, finger transitions, and rhythm that a classifier needs to learn. The solution: collect data passively in the background while the user types normally in any application.

13.1 The Idea

A background daemon runs continuously: 1. Listens to the I2S microphone for keystroke onsets 2. Simultaneously reads keyboard events from /dev/input/eventN 3. Matches each audio onset to the nearest key event by timestamp 4. Saves the labeled pair (audio segment + key identity) to disk

Over days of normal use, this accumulates thousands of naturally-labeled samples — enough for a CNN.

┌──────────────┐                     ┌──────────────┐
│ I2S Mic      │─── onset detect  ──▶│              │
│ (always on)  │    + 100ms audio    │  Timestamp   │
└──────────────┘                     │  Matcher     │──▶ keystroke_data/
                                     │              │    key_a_00142.npy
┌──────────────┐                     │  |audio_ts - │    key_e_00143.npy
│ /dev/input/  │─── key events  ────▶│   key_ts|    │    key_space_00144.npy
│ event0       │    + timestamps     │   < 50ms?    │    ...
│ (keyboard)   │                     │              │
└──────────────┘                     └──────────────┘

13.2 Timestamp Matching

Both the audio onset and the keyboard event have kernel timestamps. A keystroke produces sound a few milliseconds after the key switch closes (mechanical travel time). The matching window is generous:

MAX_MATCH_MS = 50  # audio onset within 50ms of key event

def match_onset_to_key(audio_ts, key_events):
    """Find the closest key event to an audio onset timestamp."""
    best_key = None
    best_dt = MAX_MATCH_MS / 1000.0

    for key, ts in key_events:
        dt = abs(audio_ts - ts)
        if dt < best_dt:
            best_dt = dt
            best_key = key

    return best_key  # None if no match within window

Why 50 ms? The key switch closes → sound propagates through chassis → reaches microphone. Total delay is 5–30 ms depending on keyboard type. 50 ms gives margin for timing jitter.

13.3 Reading Keyboard Events

The Linux input subsystem (covered in the Input Events tutorial) provides raw key press/release events:

import struct, os

INPUT_EVENT_FORMAT = 'llHHi'  # struct input_event
EVENT_SIZE = struct.calcsize(INPUT_EVENT_FORMAT)
EV_KEY = 0x01
KEY_PRESS = 1

# Map Linux keycodes to characters
KEYCODE_MAP = {
    30: 'a', 48: 'b', 46: 'c', 32: 'd', 18: 'e', 33: 'f',
    34: 'g', 35: 'h', 23: 'i', 36: 'j', 37: 'k', 38: 'l',
    50: 'm', 49: 'n', 24: 'o', 25: 'p', 16: 'q', 19: 'r',
    31: 's', 20: 't', 22: 'u', 47: 'v', 17: 'w', 45: 'x',
    21: 'y', 44: 'z', 57: 'space',
}

def read_key_events(device_path):
    """Generator that yields (key_name, timestamp) for each key press."""
    fd = os.open(device_path, os.O_RDONLY)
    while True:
        data = os.read(fd, EVENT_SIZE)
        tv_sec, tv_usec, ev_type, code, value = struct.unpack(
            INPUT_EVENT_FORMAT, data)
        if ev_type == EV_KEY and value == KEY_PRESS:
            key = KEYCODE_MAP.get(code)
            if key:
                ts = tv_sec + tv_usec / 1e6
                yield key, ts

Note

Reading /dev/input/eventN requires the input group or root. The course setup_pi.sh adds the user to this group. See Input Events for details on the input subsystem.

13.4 The Background Daemon

The daemon combines audio onset detection with key event reading in two threads:

#!/usr/bin/env python3
"""background_collector.py — Passive keystroke data collection daemon.

Runs in the background while the user types normally. Matches audio
onsets from the I2S mic to keyboard events from /dev/input/eventN.
Saves labeled audio segments to keystroke_data/ over time.

Run:  python background_collector.py --gain 10 --input-device /dev/input/event0
Stop: Ctrl+C or kill the process
"""

import threading, queue, time, os, numpy as np, sounddevice as sd

RATE = 48000
BLOCK_MS = 10
BLOCK = int(RATE * BLOCK_MS / 1000)
COLLECT_MS = 100  # post-onset audio to capture
COLLECT_SAMPLES = int(RATE * COLLECT_MS / 1000)
COOLDOWN_S = 0.4
MAX_MATCH_S = 0.05  # 50ms matching window

# Shared state
key_event_log = []  # [(key_name, timestamp), ...]
log_lock = threading.Lock()
save_dir = "keystroke_data/background"

def audio_callback(indata, frames, time_info, status):
    """Called every 10ms with audio data."""
    # ... onset detection + segment collection ...
    # When onset detected + segment collected:
    #   match to nearest key event, save if matched
    pass

def key_listener(device_path):
    """Thread: read keyboard events, store with timestamps."""
    for key, ts in read_key_events(device_path):
        with log_lock:
            key_event_log.append((key, ts))
            # Keep only last 2 seconds of events
            cutoff = time.time() - 2.0
            key_event_log[:] = [(k, t) for k, t in key_event_log
                                if t > cutoff]

Running as a systemd service

For always-on collection, create a systemd user service:

# ~/.config/systemd/user/keystroke-collector.service
[Unit]
Description=Background keystroke audio collector

[Service]
ExecStart=/usr/bin/python3 /home/linux/embedded-linux/scripts/acoustic-keystroke/background_collector.py --gain 10
Restart=on-failure

[Install]
WantedBy=default.target

systemctl --user enable --now keystroke-collector.service

The daemon collects data in ~/keystroke_data/background/ while you work normally.

13.5 Dataset Growth and Quality Monitoring

Over time, the dataset grows:

Day 1:     ~500 labeled samples   (casual typing, emails)
Day 3:     ~2,000 samples         (coding sessions, documentation)
Day 7:     ~5,000 samples         (enough for CNN baseline)
Day 30:    ~20,000+ samples       (robust CNN with augmentation)

Monitor collection quality with a simple script:

# check_dataset.py — show collection statistics
import os, numpy as np
from collections import Counter

data_dir = "keystroke_data/background"
counts = Counter()
for f in os.listdir(data_dir):
    if f.endswith('.npy'):
        key = f.split('_')[1]  # key_a_00142.npy → 'a'
        counts[key] += 1

print(f"Total samples: {sum(counts.values())}")
print(f"Keys represented: {len(counts)}/27")
print(f"\nPer-key counts:")
for key, n in counts.most_common():
    bar = '█' * (n // 10)
    print(f"  {key:>6}: {n:5d}  {bar}")

Class imbalance is expected — 'e' and space appear far more often than 'z' or 'q' in English text. Solutions:

Oversampling: duplicate rare-key samples during training
Data augmentation: generate synthetic variants of rare keys
Class-weighted loss: tell the CNN to penalize rare-key errors more

13.6 Privacy Considerations

Warning

Background keystroke collection raises serious privacy concerns:

The audio may capture conversations, not just keystrokes
The key log is literally a keylogger
Passwords, private messages, and sensitive data pass through

For this course: only collect on your own device with your own typing. Never deploy on shared or public machines without explicit consent.

Good practices: - Save only the 100ms audio segments around detected onsets, not continuous audio - Discard segments that don't match a key event (likely speech) - Store data locally, never transmit over network - Add a visible indicator (LED, tray icon) when collection is active

14. Language Model Post-Processing

Acoustic classification alone achieves 80-95% per-character accuracy. But humans don't type random characters — they type words. A language model can correct acoustic errors by finding the most likely word that matches the noisy predictions.

14.1 The Problem

The acoustic model predicts one key at a time. Some keys sound similar (adjacent on the keyboard → similar resonance). Typical confusions:

Acoustic prediction:    "thw quicj bropn fox"
Actual typing:          "the quick brown fox"
                          ↑       ↑  ↑
                     'e'→'w'  'k'→'j'  'w'→'p' confused

Without language correction: 3 errors in 4 words = 25% word error rate. With language correction: "thw" → "the" (obvious), "quicj" → "quick" (one-off) → 0% word error rate.

14.2 How Language Models Help

The acoustic model gives a probability distribution over keys for each keystroke. The language model gives a probability for each word given previous words. Combine them:

Score(word) = Acoustic_score × Language_score

For keystroke sequence [t, h, w]:
  Acoustic: P(w|audio) = 0.4,  P(e|audio) = 0.35,  P(s|audio) = 0.15

  Candidate words:
    "the"  → acoustic: P(t)×P(h)×P(e) = 0.9×0.8×0.35 = 0.252
             language: P("the") = 0.07  (very common word)
             combined: 0.252 × 0.07 = 0.0176

    "thw"  → acoustic: P(t)×P(h)×P(w) = 0.9×0.8×0.4 = 0.288
             language: P("thw") = 0.00001  (not a word)
             combined: 0.288 × 0.00001 = 0.0000029

  Winner: "the" (6000× more likely than "thw")

The language model doesn't need to be complex. Even a simple word frequency table ("the" appears in 7% of English text) dramatically reduces errors.

14.3 Simple N-gram Language Model

An n-gram model estimates word probability from the previous n-1 words:

from collections import defaultdict, Counter

class BigramModel:
    """Simple bigram (2-gram) language model.
    P(word | previous_word) estimated from text corpus."""

    def __init__(self):
        self.bigrams = defaultdict(Counter)  # prev → {word: count}
        self.unigrams = Counter()            # word → count
        self.total = 0

    def train(self, text_file):
        """Train on a text file (one sentence per line)."""
        with open(text_file) as f:
            for line in f:
                words = ['<s>'] + line.lower().split() + ['</s>']
                for i in range(1, len(words)):
                    self.bigrams[words[i-1]][words[i]] += 1
                    self.unigrams[words[i]] += 1
                    self.total += 1

    def prob(self, word, prev_word='<s>'):
        """P(word | prev_word) with simple smoothing."""
        bigram_count = self.bigrams[prev_word][word]
        prev_total = sum(self.bigrams[prev_word].values())

        if prev_total > 0:
            # Interpolate bigram and unigram probabilities
            p_bi = bigram_count / prev_total
            p_uni = self.unigrams[word] / self.total
            return 0.7 * p_bi + 0.3 * p_uni  # weighted mix
        else:
            return self.unigrams.get(word, 1) / self.total

Training data: Any large English text works — Wikipedia dumps, Project Gutenberg books, or even /usr/share/dict/words. For domain-specific use (coding), train on source code.

14.4 Beam Search Decoder

Beam search efficiently finds the most likely word by exploring multiple candidates simultaneously:

def beam_decode(acoustic_probs, lm, beam_width=5, prev_word='<s>'):
    """Decode a sequence of acoustic probability distributions into a word.

    acoustic_probs: list of dicts [{char: probability}, ...]
    lm: language model with .prob(word, prev_word)
    beam_width: number of candidates to keep at each step
    """
    # Start with empty candidates
    beams = [('', 1.0)]  # (partial_word, cumulative_score)

    for probs in acoustic_probs:
        new_beams = []
        for partial, score in beams:
            for char, p_acoustic in probs.items():
                if p_acoustic < 0.05:  # prune unlikely characters
                    continue
                new_word = partial + char
                new_score = score * p_acoustic
                new_beams.append((new_word, new_score))

        # Keep only top beam_width candidates
        new_beams.sort(key=lambda x: -x[1])
        beams = new_beams[:beam_width]

    # Score final candidates with language model
    scored = []
    for word, acoustic_score in beams:
        lm_score = lm.prob(word, prev_word)
        combined = acoustic_score * lm_score
        scored.append((word, combined, acoustic_score, lm_score))

    scored.sort(key=lambda x: -x[1])
    return scored[0][0]  # best word

14.5 Integration Pipeline

The complete system chains acoustic classification, beam search, and language correction:

Audio stream
    │
    ▼
Onset detection ──► 100ms segment ──► Feature extraction
    │                                        │
    │                                        ▼
    │                                 CNN / SVM prediction
    │                                 (per-key probabilities)
    │                                        │
    │                                        ▼
    │                                 Beam search decoder
    │                                 (acoustic × language model)
    │                                        │
    │                    ┌───────────────────┘
    │                    ▼
    │              Corrected word
    │                    │
    │                    ▼
    │              Autocomplete candidates
    │              (optional: suggest completions
    │               from language model)
    │                    │
    ▼                    ▼
Display: "the quick brown f|ox"
                          ↑ cursor + suggestions

Connection to Speech Recognition

This architecture — acoustic model + language model + beam search — is exactly how automatic speech recognition (ASR) systems like Whisper, DeepSpeech, and Kaldi work. The acoustic model converts sound to character probabilities, and the language model finds the most likely text. In ASR, the acoustic model is a large neural network; here, it's our simpler CNN. The language model and decoder are identical.

This is why this tutorial is a good stepping stone toward understanding speech recognition pipelines — the core architecture is the same, just at different scales.

14.6 Expected Improvement

Approach	Per-character accuracy	Word accuracy
SVM only (20 samples/key)	75%	~50%
CNN only (200 samples/key)	90%	~75%
CNN + unigram LM	90% → 94% corrected	~88%
CNN + bigram LM	90% → 96% corrected	~93%
CNN + bigram LM + autocomplete	90% → 98% effective	~97%

The language model contributes most when the acoustic model is uncertain. If the CNN is 99% accurate, the LM barely helps. If the CNN is 70% accurate, the LM can recover many words that are obvious from context.

Challenges (Extended)

Tip

Challenge 6: Data Budget Experiment Train the SVM with 10, 20, 50, and 100 samples per key. Plot accuracy vs. dataset size. At what point does accuracy plateau? This tells you the minimum data investment needed for your keyboard + mic setup.

Tip

Challenge 7: CNN vs SVM Learning Curves Plot training and validation accuracy vs. epoch for the CNN. Compare the final accuracy to SVM at the same dataset size. Create a table like Section 11.2 with your actual numbers.

Tip

Challenge 8: Feature Importance Analysis Train a Random Forest and plot feature_importances_. Which mel bands are most important? Which time frames? Map these back to the spectrogram to understand what acoustic properties distinguish keys. Compare to what the CNN's first-layer filters learned.

Tip

Challenge 9: Cross-Keyboard Transfer Train on one keyboard, test on a different one. How much accuracy drops? Can you improve transfer by: - Using only spectral shape (normalized per-frame) instead of absolute magnitudes? - Fine-tuning the CNN on 5 samples per key from the new keyboard?

Tip

Challenge 10: Real-Time CNN Inference Modify live_inference.py to use the ONNX model instead of scikit-learn. Measure the inference latency on the Pi. Is the CNN fast enough for real-time? What's the accuracy difference in practice?

Acoustic Keystroke Recognition

Why This Works

1. Architecture

2. Setup

Dependencies

Verify Microphone

3. Signal Conditioning

3.1 Software Gain

3.2 High-Pass Filter

4. Onset Detection

Training vs Inference Onset Detection

5. Feature Extraction

What Is a Feature?

What Makes a Good Feature for Keystrokes?

5.1 Mel Spectrogram — How It's Built

5.2 Amplitude Normalization

5.3 Additional Features

6. Data Collection

6.1 Per-Key Collection (collect_keystrokes.py)

6.2 Typing Practice (typing_practice.py)

6.3 Workflow: Collect on Pi, Train on Host

6.4 Collection Tips

7. Training the Classifier

7.1 The Pipeline

7.2 Support Vector Machine (SVM)

7.3 Cross-Validation

7.4 Data Augmentation

7.5 Training Commands

8. Real-Time Inference

8.1 How It Works

8.2 Running

8.3 Troubleshooting

9. How the Models Work — In Depth

9.1 SVM with RBF Kernel (Default)

9.2 Random Forest (Alternative)

9.3 Why Not Deep Learning?

10. Visualization with Audio Viz

Alternatives to Explore

Challenges

11. From Classical ML to Deep Learning

11.1 Why a CNN for Audio?

11.2 When Does CNN Beat SVM?

11.3 Collecting More Data

11.4 Building the CNN

Setup

Network Architecture

Training Loop

11.5 SVM vs CNN: A Fair Comparison

11.6 Deploying the CNN on Raspberry Pi

Export

Install ONNX Runtime on Pi

Inference

Latency Comparison

12. Understanding What the CNN Learns

12.1 Visualizing Filters

12.2 Confusion Heatmap

12.3 t-SNE: Visualizing the Feature Space

13. Background Data Collection

13.1 The Idea

13.2 Timestamp Matching

13.3 Reading Keyboard Events

13.4 The Background Daemon

13.5 Dataset Growth and Quality Monitoring

13.6 Privacy Considerations

14. Language Model Post-Processing

14.1 The Problem

14.2 How Language Models Help

14.3 Simple N-gram Language Model

14.4 Beam Search Decoder

14.5 Integration Pipeline

14.6 Expected Improvement

Challenges (Extended)

Further Reading

6.1 Per-Key Collection (`collect_keystrokes.py`)

6.2 Typing Practice (`typing_practice.py`)