SPI DMA: From Polling to Hardware Acceleration
Time estimate: ~60 minutes Prerequisites: Jitter Measurement, BMI160 SPI Driver
Learning Objectives
By the end of this tutorial you will be able to:
- Understand polling vs interrupt vs DMA for SPI transfers
- Measure CPU load reduction from optimized SPI access
- Implement FIFO burst reads for batched sensor data
- See impact of reduced CPU load on display smoothness
DMA and SPI Throughput Optimization
Peripheral I/O competes with application logic and rendering for CPU time. When the CPU handles every byte of an SPI transfer (PIO — Programmed I/O), it cannot do anything else during the transfer. DMA (Direct Memory Access) solves this: the DMA controller moves data between memory and the SPI peripheral autonomously, freeing the CPU for rendering or computation. The Linux SPI subsystem automatically uses DMA for transfers above a threshold (~96 bytes on BCM2835). By combining small individual register reads into a single bulk transfer, or by using the sensor's hardware FIFO to batch samples, you can exceed this threshold and unlock DMA. FIFO batching also reduces interrupt frequency — reading 20 samples at once means 10 reads/sec instead of 200, dramatically lowering per-sample overhead even if each read takes longer.
See also: Real-Time Graphics reference
1. Baseline: Individual Register Reads
The current BMI160 driver reads each accelerometer axis as a separate SPI transaction. Each transaction involves:
- CPU writes the register address to the SPI TX buffer
- CPU waits for the SPI controller to shift out the data
- CPU reads the received byte from the SPI RX buffer
- Repeat for each register
For 3 axes (X, Y, Z), each 2 bytes, that is 6 separate SPI transactions just for the accelerometer.
Measure the current CPU usage while the level display reads at 200 Hz:
# Terminal 1: run the level display
sudo SDL_VIDEODRIVER=kmsdrm ./level_sdl2 &
# Terminal 2: measure per-core CPU usage
mpstat -P ALL 1 10
Record the CPU percentage for the core running the sensor thread. Also measure with top:
Note down the CPU% — this is your baseline.
Tip
To identify which core the sensor thread runs on, use:
Thepsr column shows the processor (core) number.
Checkpoint
You have a baseline CPU usage measurement for the current individual-register-read implementation.
2. SPI DMA on BCM2835
The Raspberry Pi's BCM2835/BCM2711 SPI controller supports DMA (Direct Memory Access). The Linux SPI subsystem automatically uses DMA for transfers above a threshold — typically around 96 bytes.
Check the SPI DMA initialization in the kernel log:
You should see messages about DMA channel allocation for the SPI controller.
For transfers below the DMA threshold:
- The SPI driver uses PIO (Programmed I/O) — the CPU directly reads/writes SPI registers
- Each byte transferred requires CPU attention
- For our 2-byte register reads, DMA is never triggered
For transfers above the threshold:
- The SPI driver sets up a DMA descriptor pointing to the TX/RX buffers
- The DMA controller moves data between memory and the SPI peripheral
- The CPU is free to do other work during the transfer
- An interrupt signals completion
| Transfer Size | Method | CPU During Transfer |
|---|---|---|
| < 96 bytes | PIO (CPU-driven) | Busy |
| >= 96 bytes | DMA | Free |
The key insight: our individual 2-byte reads never benefit from DMA. We need to combine them into larger transfers.
3. Optimize: Bulk Transfer
The BMI160 sensor stores gyroscope and accelerometer data in a contiguous register block from 0x0C to 0x17 (12 bytes total: 3 axes of gyro + 3 axes of accel, each 16-bit).
Instead of 6 separate 2-byte reads, we can read all 12 bytes in a single SPI transaction.
Modify the driver's read function in bmi160_spi.c:
Before (individual reads):
/* Six separate SPI transactions */
static int bmi160_read_all(struct bmi160_dev *dev, struct bmi160_data *data)
{
data->gx = bmi160_read_reg16(dev, 0x0C);
data->gy = bmi160_read_reg16(dev, 0x0E);
data->gz = bmi160_read_reg16(dev, 0x10);
data->ax = bmi160_read_reg16(dev, 0x12);
data->ay = bmi160_read_reg16(dev, 0x14);
data->az = bmi160_read_reg16(dev, 0x16);
return 0;
}
After (single bulk read):
/* One SPI transaction for all 12 bytes */
static int bmi160_read_all(struct bmi160_dev *dev, struct bmi160_data *data)
{
u8 tx[14] = { 0x80 | 0x0C }; /* read bit | start register */
u8 rx[14] = { 0 };
struct spi_transfer xfer = {
.tx_buf = tx,
.rx_buf = rx,
.len = 14, /* 1 addr + 1 dummy + 12 data */
};
struct spi_message msg;
int ret;
spi_message_init(&msg);
spi_message_add_tail(&xfer, &msg);
ret = spi_sync(dev->spi, &msg);
if (ret)
return ret;
/* Skip first 2 bytes (address echo + dummy) */
data->gx = le16_to_cpup((__le16 *)&rx[2]);
data->gy = le16_to_cpup((__le16 *)&rx[4]);
data->gz = le16_to_cpup((__le16 *)&rx[6]);
data->ax = le16_to_cpup((__le16 *)&rx[8]);
data->ay = le16_to_cpup((__le16 *)&rx[10]);
data->az = le16_to_cpup((__le16 *)&rx[12]);
return 0;
}
The BMI160 supports auto-increment — when you start reading at register 0x0C, subsequent bytes come from 0x0D, 0x0E, and so on. This is standard for most SPI sensors.
Rebuild and reload the module:
Checkpoint
After reloading the module, the level display app still works correctly. Verify the sensor data matches (values should be the same as before).
Stuck?
- If data looks wrong after the change, check the byte offset — SPI read has a 1-byte dummy after the address byte on BMI160
- Verify endianness: BMI160 stores data as little-endian, which matches ARM natively
- If module fails to load, check
dmesg | tailfor error messages
4. Measure Improvement
Repeat the CPU measurement with the bulk-read driver:
# Terminal 1: run the level display
sudo SDL_VIDEODRIVER=kmsdrm ./level_sdl2 &
# Terminal 2: measure CPU
mpstat -P ALL 1 10
top -bn1 | grep level_sdl2
Compare with your baseline measurement. The improvement comes from:
- Fewer SPI transactions: 1 instead of 6 (less per-transaction overhead)
- Less context switching: each
spi_sync()call may involve scheduling - Better SPI bus utilization: one continuous clock burst instead of 6 short bursts with gaps
Tip
The improvement may seem small at 200 Hz. The real benefit appears at higher sample rates (800-1600 Hz) where per-transaction overhead dominates.
5. FIFO Burst Read
The BMI160 has a built-in FIFO buffer that can store up to 1024 bytes of sensor data. Instead of reading every 5 ms at 200 Hz, we can let the sensor accumulate samples in its FIFO and read them all at once.
Concept:
- Configure the BMI160 FIFO to store accelerometer and gyroscope frames
- Set a watermark — when the FIFO reaches this many samples, trigger a read
- Read all buffered samples in one large SPI transfer (DMA-eligible!)
- Process all samples at once
Configure the FIFO via sysfs (if the driver exposes it):
# Set watermark to 20 frames (20 * 12 bytes = 240 bytes — above DMA threshold!)
echo 20 > /sys/class/bmi160/bmi160/fifo_watermark
The FIFO burst read in the driver:
#define FIFO_FRAME_SIZE 12 /* 6 axes * 2 bytes */
#define FIFO_MAX_FRAMES 20
static int bmi160_read_fifo(struct bmi160_dev *dev,
struct bmi160_data *buf, int *count)
{
/* Read FIFO length register */
u16 fifo_len = bmi160_read_reg16(dev, 0x22);
int frames = fifo_len / FIFO_FRAME_SIZE;
if (frames > FIFO_MAX_FRAMES)
frames = FIFO_MAX_FRAMES;
/* Bulk read: 1 addr + 1 dummy + (frames * 12) bytes */
int xfer_len = 2 + frames * FIFO_FRAME_SIZE;
/* ... set up SPI transfer and read ... */
*count = frames;
return 0;
}
With a watermark of 20 at 200 Hz ODR, the driver reads every 100 ms instead of every 5 ms — reducing interrupt frequency by 20x. And each read is 242 bytes, which is above the DMA threshold.
Checkpoint
FIFO mode produces valid data — verify by checking that individual samples within the burst match expected ranges and that the total sample count matches the watermark.
6. CPU Usage Comparison Table
Measure CPU usage for all three methods and fill in this table:
| Method | CPU per read (us) | Reads/sec | CPU total (%) |
|---|---|---|---|
| Individual register reads | 200 | ||
| Bulk 12-byte read | 200 | ||
| FIFO burst (20 samples) | 10 |
To measure CPU time per read accurately, use the timestamps from the driver or instrument the read function:
# Enable driver debug timing
echo 1 > /sys/class/bmi160/bmi160/debug_timing
cat /sys/class/bmi160/bmi160/read_time_us
Or measure from user space:
import time
t0 = time.monotonic_ns()
data = read_sensor()
t1 = time.monotonic_ns()
print(f"Read took {(t1-t0)/1000:.0f} us")
Tip
The FIFO method dramatically reduces reads/sec (200 down to 10), which is the primary source of CPU savings. Even if each FIFO read takes longer than an individual read, the total CPU time is much lower because you do it 20x less often.
7. Impact on Display
With reduced CPU load from the sensor thread, the render thread has more headroom. This matters most under stress conditions.
Re-run the jitter measurement from the previous tutorial with the FIFO-optimized driver:
# With FIFO optimization, under stress
stress-ng --cpu 3 &
sudo SDL_VIDEODRIVER=kmsdrm ./level_sdl2 -l fifo_stress.csv &
sleep 120
kill %2
kill %1
Compare with your earlier stress.csv (standard driver, standard kernel, under load):
python3 src/embedded-linux/scripts/jitter-measurement/analyze_jitter.py \
stress.csv fifo_stress.csv \
--labels "Baseline+Stress" "FIFO+Stress" \
--plot
Expected improvements:
- Fewer dropped frames under load
- Lower 99th percentile latency
- More consistent sensor dt (FIFO batching smooths out scheduling jitter)
Checkpoint
The FIFO-optimized driver shows measurably fewer dropped frames under stress compared to the individual-read driver.
What Just Happened?
DMA moves data without CPU intervention. For transfers above the SPI DMA threshold (~96 bytes), the DMA controller handles the byte-by-byte transfer between the SPI peripheral and memory. The CPU only needs to set up the transfer descriptor and handle the completion interrupt.
FIFO batching reduces interrupt frequency. Instead of the CPU servicing 200 interrupts per second (one per sensor sample), the FIFO accumulates samples and triggers only 10 reads per second. Each read is larger but the per-read overhead (context switch, SPI setup, interrupt handling) is paid far less often.
Both free CPU cycles for rendering. The render thread competes with the sensor thread for CPU time. By reducing the sensor thread's CPU footprint, more cycles are available for frame preparation, leading to fewer dropped frames under load.
This is the same optimization pattern used in production embedded systems — minimize per-sample overhead. Industrial IMUs, GPS receivers, and data acquisition systems all use FIFO buffering and DMA to achieve high data rates without proportional CPU load.
Challenges
Challenge: Maximum ODR
Push the BMI160 to its maximum output data rate of 1600 Hz. Configure the FIFO watermark appropriately and measure whether FIFO + DMA can sustain this rate without data loss. Monitor the FIFO overflow flag register (0x1B, bit 6) to detect if the FIFO fills faster than you can read it.
Deliverable
- [ ] CPU usage comparison table filled in for all three methods (individual, bulk, FIFO)
- [ ] Jitter comparison plot: standard driver vs FIFO-optimized driver under stress load
- [ ] Written explanation of why FIFO batching reduces CPU usage even though each individual read is larger