SPI DMA: From Polling to Hardware Acceleration

Time estimate: ~60 minutes Prerequisites: Jitter Measurement, BMI160 SPI Driver

Learning Objectives

By the end of this tutorial you will be able to:

Understand polling vs interrupt vs DMA for SPI transfers
Measure CPU load reduction from optimized SPI access
Implement FIFO burst reads for batched sensor data
See impact of reduced CPU load on display smoothness

DMA and SPI Throughput Optimization

Peripheral I/O competes with application logic and rendering for CPU time. When the CPU handles every byte of an SPI transfer (PIO — Programmed I/O), it cannot do anything else during the transfer. DMA (Direct Memory Access) solves this: the DMA controller moves data between memory and the SPI peripheral autonomously, freeing the CPU for rendering or computation. The Linux SPI subsystem automatically uses DMA for transfers above a threshold (~96 bytes on BCM2835). By combining small individual register reads into a single bulk transfer, or by using the sensor's hardware FIFO to batch samples, you can exceed this threshold and unlock DMA. FIFO batching also reduces interrupt frequency — reading 20 samples at once means 10 reads/sec instead of 200, dramatically lowering per-sample overhead even if each read takes longer.

1. Baseline: Individual Register Reads

The current BMI160 driver reads each accelerometer axis as a separate SPI transaction. Each transaction involves:

CPU writes the register address to the SPI TX buffer
CPU waits for the SPI controller to shift out the data
CPU reads the received byte from the SPI RX buffer
Repeat for each register

For 3 axes (X, Y, Z), each 2 bytes, that is 6 separate SPI transactions just for the accelerometer.

Measure the current CPU usage while the level display reads at 200 Hz:

# Terminal 1: run the level display
sudo SDL_VIDEODRIVER=kmsdrm ./level_sdl2 &

# Terminal 2: measure per-core CPU usage
mpstat -P ALL 1 10

Record the CPU percentage for the core running the sensor thread. Also measure with top:

top -bn1 | grep level_sdl2

Note down the CPU% — this is your baseline.

Tip

To identify which core the sensor thread runs on, use:

ps -eLo pid,tid,psr,comm | grep level

The psr column shows the processor (core) number.

Checkpoint

You have a baseline CPU usage measurement for the current individual-register-read implementation.

2. SPI DMA on BCM2835

The Raspberry Pi's BCM2835/BCM2711 SPI controller supports DMA (Direct Memory Access). The Linux SPI subsystem automatically uses DMA for transfers above a threshold — typically around 96 bytes.

Check the SPI DMA initialization in the kernel log:

dmesg | grep -i "spi\|dma"

You should see messages about DMA channel allocation for the SPI controller.

For transfers below the DMA threshold:

The SPI driver uses PIO (Programmed I/O) — the CPU directly reads/writes SPI registers
Each byte transferred requires CPU attention
For our 2-byte register reads, DMA is never triggered

For transfers above the threshold:

The SPI driver sets up a DMA descriptor pointing to the TX/RX buffers
The DMA controller moves data between memory and the SPI peripheral
The CPU is free to do other work during the transfer
An interrupt signals completion

Transfer Size	Method	CPU During Transfer
< 96 bytes	PIO (CPU-driven)	Busy
>= 96 bytes	DMA	Free

The key insight: our individual 2-byte reads never benefit from DMA. We need to combine them into larger transfers.

3. Optimize: Bulk Transfer

The BMI160 sensor stores gyroscope and accelerometer data in a contiguous register block from 0x0C to 0x17 (12 bytes total: 3 axes of gyro + 3 axes of accel, each 16-bit).

Instead of 6 separate 2-byte reads, we can read all 12 bytes in a single SPI transaction.

Modify the driver's read function in bmi160_spi.c:

Before (individual reads):

/* Six separate SPI transactions */
static int bmi160_read_all(struct bmi160_dev *dev, struct bmi160_data *data)
{
    data->gx = bmi160_read_reg16(dev, 0x0C);
    data->gy = bmi160_read_reg16(dev, 0x0E);
    data->gz = bmi160_read_reg16(dev, 0x10);
    data->ax = bmi160_read_reg16(dev, 0x12);
    data->ay = bmi160_read_reg16(dev, 0x14);
    data->az = bmi160_read_reg16(dev, 0x16);
    return 0;
}

After (single bulk read):

/* One SPI transaction for all 12 bytes */
static int bmi160_read_all(struct bmi160_dev *dev, struct bmi160_data *data)
{
    u8 tx[14] = { 0x80 | 0x0C };  /* read bit | start register */
    u8 rx[14] = { 0 };
    struct spi_transfer xfer = {
        .tx_buf = tx,
        .rx_buf = rx,
        .len = 14,  /* 1 addr + 1 dummy + 12 data */
    };
    struct spi_message msg;
    int ret;

    spi_message_init(&msg);
    spi_message_add_tail(&xfer, &msg);
    ret = spi_sync(dev->spi, &msg);
    if (ret)
        return ret;

    /* Skip first 2 bytes (address echo + dummy) */
    data->gx = le16_to_cpup((__le16 *)&rx[2]);
    data->gy = le16_to_cpup((__le16 *)&rx[4]);
    data->gz = le16_to_cpup((__le16 *)&rx[6]);
    data->ax = le16_to_cpup((__le16 *)&rx[8]);
    data->ay = le16_to_cpup((__le16 *)&rx[10]);
    data->az = le16_to_cpup((__le16 *)&rx[12]);

    return 0;
}

The BMI160 supports auto-increment — when you start reading at register 0x0C, subsequent bytes come from 0x0D, 0x0E, and so on. This is standard for most SPI sensors.

Rebuild and reload the module:

cd src/embedded-linux/drivers/bmi160
make
sudo rmmod bmi160_spi
sudo insmod bmi160_spi.ko

Checkpoint

After reloading the module, the level display app still works correctly. Verify the sensor data matches (values should be the same as before).

Stuck?

If data looks wrong after the change, check the byte offset — SPI read has a 1-byte dummy after the address byte on BMI160
Verify endianness: BMI160 stores data as little-endian, which matches ARM natively
If module fails to load, check dmesg | tail for error messages

4. Measure Improvement

Repeat the CPU measurement with the bulk-read driver:

# Terminal 1: run the level display
sudo SDL_VIDEODRIVER=kmsdrm ./level_sdl2 &

# Terminal 2: measure CPU
mpstat -P ALL 1 10
top -bn1 | grep level_sdl2

Compare with your baseline measurement. The improvement comes from:

Fewer SPI transactions: 1 instead of 6 (less per-transaction overhead)
Less context switching: each spi_sync() call may involve scheduling
Better SPI bus utilization: one continuous clock burst instead of 6 short bursts with gaps

Tip

The improvement may seem small at 200 Hz. The real benefit appears at higher sample rates (800-1600 Hz) where per-transaction overhead dominates.

5. FIFO Burst Read

The BMI160 has a built-in FIFO buffer that can store up to 1024 bytes of sensor data. Instead of reading every 5 ms at 200 Hz, we can let the sensor accumulate samples in its FIFO and read them all at once.

Concept:

Configure the BMI160 FIFO to store accelerometer and gyroscope frames
Set a watermark — when the FIFO reaches this many samples, trigger a read
Read all buffered samples in one large SPI transfer (DMA-eligible!)
Process all samples at once

Configure the FIFO via sysfs (if the driver exposes it):

# Set watermark to 20 frames (20 * 12 bytes = 240 bytes — above DMA threshold!)
echo 20 > /sys/class/bmi160/bmi160/fifo_watermark

The FIFO burst read in the driver:

#define FIFO_FRAME_SIZE  12  /* 6 axes * 2 bytes */
#define FIFO_MAX_FRAMES  20

static int bmi160_read_fifo(struct bmi160_dev *dev,
                            struct bmi160_data *buf, int *count)
{
    /* Read FIFO length register */
    u16 fifo_len = bmi160_read_reg16(dev, 0x22);
    int frames = fifo_len / FIFO_FRAME_SIZE;

    if (frames > FIFO_MAX_FRAMES)
        frames = FIFO_MAX_FRAMES;

    /* Bulk read: 1 addr + 1 dummy + (frames * 12) bytes */
    int xfer_len = 2 + frames * FIFO_FRAME_SIZE;
    /* ... set up SPI transfer and read ... */

    *count = frames;
    return 0;
}

With a watermark of 20 at 200 Hz ODR, the driver reads every 100 ms instead of every 5 ms — reducing interrupt frequency by 20x. And each read is 242 bytes, which is above the DMA threshold.

Checkpoint

FIFO mode produces valid data — verify by checking that individual samples within the burst match expected ranges and that the total sample count matches the watermark.

6. CPU Usage Comparison Table

Measure CPU usage for all three methods and fill in this table:

Method	CPU per read (us)	Reads/sec	CPU total (%)
Individual register reads		200
Bulk 12-byte read		200
FIFO burst (20 samples)		10

To measure CPU time per read accurately, use the timestamps from the driver or instrument the read function:

# Enable driver debug timing
echo 1 > /sys/class/bmi160/bmi160/debug_timing
cat /sys/class/bmi160/bmi160/read_time_us

Or measure from user space:

import time
t0 = time.monotonic_ns()
data = read_sensor()
t1 = time.monotonic_ns()
print(f"Read took {(t1-t0)/1000:.0f} us")

Tip

The FIFO method dramatically reduces reads/sec (200 down to 10), which is the primary source of CPU savings. Even if each FIFO read takes longer than an individual read, the total CPU time is much lower because you do it 20x less often.

7. Impact on Display

With reduced CPU load from the sensor thread, the render thread has more headroom. This matters most under stress conditions.

Re-run the jitter measurement from the previous tutorial with the FIFO-optimized driver:

# With FIFO optimization, under stress
stress-ng --cpu 3 &
sudo SDL_VIDEODRIVER=kmsdrm ./level_sdl2 -l fifo_stress.csv &
sleep 120
kill %2
kill %1

Compare with your earlier stress.csv (standard driver, standard kernel, under load):

python3 src/embedded-linux/scripts/jitter-measurement/analyze_jitter.py \
    stress.csv fifo_stress.csv \
    --labels "Baseline+Stress" "FIFO+Stress" \
    --plot

Expected improvements:

Fewer dropped frames under load
Lower 99th percentile latency
More consistent sensor dt (FIFO batching smooths out scheduling jitter)

Checkpoint

The FIFO-optimized driver shows measurably fewer dropped frames under stress compared to the individual-read driver.

What Just Happened?

DMA moves data without CPU intervention. For transfers above the SPI DMA threshold (~96 bytes), the DMA controller handles the byte-by-byte transfer between the SPI peripheral and memory. The CPU only needs to set up the transfer descriptor and handle the completion interrupt.

FIFO batching reduces interrupt frequency. Instead of the CPU servicing 200 interrupts per second (one per sensor sample), the FIFO accumulates samples and triggers only 10 reads per second. Each read is larger but the per-read overhead (context switch, SPI setup, interrupt handling) is paid far less often.

Both free CPU cycles for rendering. The render thread competes with the sensor thread for CPU time. By reducing the sensor thread's CPU footprint, more cycles are available for frame preparation, leading to fewer dropped frames under load.

This is the same optimization pattern used in production embedded systems — minimize per-sample overhead. Industrial IMUs, GPS receivers, and data acquisition systems all use FIFO buffering and DMA to achieve high data rates without proportional CPU load.

Challenges

Challenge: Maximum ODR

Push the BMI160 to its maximum output data rate of 1600 Hz. Configure the FIFO watermark appropriately and measure whether FIFO + DMA can sustain this rate without data loss. Monitor the FIFO overflow flag register (0x1B, bit 6) to detect if the FIFO fills faster than you can read it.

# Set ODR to 1600 Hz
echo 1600 > /sys/class/bmi160/bmi160/odr_hz
# Set watermark to 80 samples (80 * 12 = 960 bytes, near FIFO limit)
echo 80 > /sys/class/bmi160/bmi160/fifo_watermark

Deliverable

[ ] CPU usage comparison table filled in for all three methods (individual, bulk, FIFO)
[ ] Jitter comparison plot: standard driver vs FIFO-optimized driver under stress load
[ ] Written explanation of why FIFO batching reduces CPU usage even though each individual read is larger

Course Overview