DMA Fundamentals

Goal: Understand Direct Memory Access — how the DMA controller moves data between peripherals and memory without CPU involvement, when DMA helps, and how to measure its impact on your embedded Linux system.

1. What Is DMA?

In a CPU-driven (PIO) transfer, the CPU reads one byte from a peripheral register, writes it to RAM, repeats — 100% busy for the entire transfer:

CPU-driven (PIO):
  CPU ──read──► SPI DR ──store──► RAM ──read──► SPI DR ──store──► RAM ...
  CPU is busy the entire time. No cycles left for your application.

With DMA, the CPU configures the DMA controller once (source, destination, length) and then is free while the DMA engine moves the data autonomously:

DMA-driven:
  CPU: configure DMA (src=SPI, dst=RAM, len=4096) → done, go do other work
  DMA: SPI DR ──► RAM ──► SPI DR ──► RAM ... (hardware handles it)
  DMA: interrupt → "transfer complete"
  CPU: process the buffer

graph LR
    CPU[CPU] -->|configure| DMA[DMA Controller]
    DMA -->|read| PERIPH[SPI / I2C<br>Peripheral]
    DMA -->|write| MEM[System<br>Memory]
    DMA -->|IRQ: done| CPU
    CPU -->|free during<br>transfer| APP[Application<br>Code]

	CPU-Driven (PIO)	DMA
CPU usage during transfer	100%	Near 0%
Throughput	Limited by CPU speed	Limited by bus/memory speed
Latency	Low (immediate start)	Higher (DMA setup overhead)
Best for	Small transfers (<64 bytes)	Large transfers (>256 bytes)

2. DMA on the Raspberry Pi

BCM2835/BCM2711 DMA Controllers

The Raspberry Pi's SoC has a multi-channel DMA controller:

SoC	DMA channels	Lite channels	Notes
BCM2835 (Pi 1/Zero)	15	Channels 0–6 (full), 7–14 (lite)	Lite channels: max 64KB, no 2D stride
BCM2711 (Pi 4)	15 + 4	Standard + 4 DMA4 channels	DMA4 supports 40-bit addressing

SPI DMA Threshold

The SPI controller driver (spi-bcm2835) automatically switches between PIO and DMA based on transfer size:

Transfer size:
  < 96 bytes  →  PIO (CPU copies bytes directly)
  ≥ 96 bytes  →  DMA (DMA controller moves the data)

This threshold (~96 bytes on BCM2835) exists because DMA setup has overhead — for tiny transfers, PIO is actually faster.

Bus-Specific DMA Usage

Bus	Typical DMA usage	Why
SPI	Automatic above threshold	Large transfers (display, IMU FIFO bursts)
I2C	Generally PIO	Transfers too small (sensor reads are 2–6 bytes)
HDMI	GPU DMA (separate path)	Continuous scan-out, dedicated DMA channel
SPI displays	SPI DMA for framebuffer writes	~150 KB per frame at 320×480 RGB565
SD/eMMC	Always DMA	Block transfers are 512+ bytes

3. Linux DMA API (for Driver Authors)

The kernel provides three levels of DMA abstraction. Most driver authors use the first two; the third is used internally by bus subsystems.

Coherent DMA

void *buf;
dma_addr_t dma_handle;

buf = dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL);
/* buf: CPU virtual address
   dma_handle: bus address the DMA controller uses */

Kernel and device see the same memory — no cache flush needed
Used for long-lived buffers: ring buffers, descriptor tables, command queues
Memory is uncacheable → slightly slower CPU access, but always consistent
Free with dma_free_coherent(dev, size, buf, dma_handle)

Streaming DMA

dma_addr_t dma_handle;

/* Map an existing buffer for DMA (one-shot) */
dma_handle = dma_map_single(dev, buf, size, DMA_FROM_DEVICE);

/* ... DMA transfer happens ... */

/* Sync before CPU reads the data */
dma_sync_single_for_cpu(dev, dma_handle, size, DMA_FROM_DEVICE);

/* Unmap when done */
dma_unmap_single(dev, dma_handle, size, DMA_FROM_DEVICE);

Maps existing buffers for one-shot or short-lived DMA transfers
Requires explicit sync (dma_sync_single_for_cpu) before CPU reads DMA'd data
Used for per-transfer data: SPI message buffers, network packet buffers
Direction: DMA_TO_DEVICE (CPU→peripheral), DMA_FROM_DEVICE (peripheral→CPU), DMA_BIDIRECTIONAL

DMA Engine API

struct dma_chan *chan;
struct dma_async_tx_descriptor *desc;

chan = dma_request_chan(dev, "rx");
desc = dmaengine_prep_slave_single(chan, dma_handle, size,
                                    DMA_DEV_TO_MEM, DMA_PREP_INTERRUPT);
desc->callback = my_dma_complete;
dmaengine_submit(desc);
dma_async_issue_pending(chan);

Higher-level abstraction for peripheral DMA (memory-to-peripheral and peripheral-to-memory)
Used internally by the SPI, I2C, and UART subsystems
As a driver author using SPI/I2C, you get this for free — the bus subsystem handles it

Scatter-Gather

struct scatterlist sg[N_PAGES];
int n_mapped;

sg_init_table(sg, N_PAGES);
for (i = 0; i < N_PAGES; i++)
    sg_set_page(&sg[i], pages[i], PAGE_SIZE, 0);

n_mapped = dma_map_sg(dev, sg, N_PAGES, DMA_FROM_DEVICE);

Maps non-contiguous physical pages as a single logical DMA transfer
Essential when your buffer spans multiple pages (common for large allocations)
The DMA controller handles the page boundaries transparently

4. DMA in Practice

SPI Subsystem: Automatic DMA

When you call spi_sync() or spi_async() in your driver, the SPI core automatically uses DMA for transfers above the threshold. Driver authors get DMA for free — no DMA API calls needed:

/* Your driver code — no DMA awareness needed */
struct spi_transfer t = {
    .rx_buf = buf,
    .len = 2400,    /* Above threshold → SPI core uses DMA */
};
spi_sync_transfer(spi, &t, 1);

IIO Buffered Mode

The IIO subsystem leverages DMA through the bus subsystem:

Hardware trigger (data-ready IRQ or hrtimer)
    │
    ▼
IIO trigger handler in driver
    │
    ▼
spi_sync() to read sensor data
    │
    ├── < threshold → PIO
    └── ≥ threshold → DMA (automatic)
    │
    ▼
iio_push_to_buffers_with_timestamp()
    │
    ▼
kfifo ring buffer
    │
    ▼
Userspace reads /dev/iio:deviceN

Framebuffer: DMA vs Deferred I/O

Display type	Data path	DMA?
SPI TFT (ILI9341, ST7789)	SPI DMA for bulk pixel writes	Yes — large transfers
SPI OLED (SSD1306)	SPI DMA possible but small (1 KB)	Marginal benefit
I2C OLED (SSD1306)	PIO (I2C transfers are small)	No
HDMI / DSI	GPU DMA scanout	Yes — dedicated path

For the SSD1306 OLED framebuffer driver: the fbdev deferred I/O mechanism tracks dirty pages via page faults, then flushes changed regions to the display. This is not DMA — it's CPU-driven I2C/SPI writes triggered by page faults. True DMA benefits appear with larger displays (SPI TFTs at 320×480).

When DMA Helps vs When It's Overkill

Scenario	Transfer size	DMA benefit
MCP9808 temperature read (I2C)	2 bytes	None — PIO is faster
BMI160 single-axis read (SPI)	2 bytes	None
BMI160 FIFO burst (SPI, 200 samples)	2400 bytes	Significant CPU savings
SSD1306 OLED full-screen write (SPI)	1024 bytes	Moderate
ILI9341 TFT full-screen write (SPI)	153,600 bytes	Essential
SD card block write	512+ bytes	Always used

5. Measuring DMA Impact

CPU Load: `mpstat`

Compare CPU utilization with and without DMA-sized transfers:

# Monitor per-second CPU stats while your application runs
mpstat 1

# Example output during PIO sensor reads at 200 Hz:
#   %usr   %sys   %idle
#   12.3   8.7    79.0

# Same workload with DMA (FIFO burst reads):
#   %usr   %sys   %idle
#   12.1   1.2    86.7

The %sys column shows kernel CPU time — DMA reduces this significantly for large transfers.

Bus Cycle Counters: `perf stat`

# Count CPU cycles spent in SPI-related kernel functions
sudo perf stat -e cycles,instructions,cache-misses \
    -p $(pidof your_app) -- sleep 5

DMA Channel Tracing: `ftrace`

# Trace DMA channel setup and completion
echo 1 > /sys/kernel/debug/tracing/events/dma/enable
cat /sys/kernel/debug/tracing/trace_pipe

Verifying DMA Is Active

# Check for SPI DMA channel allocation at boot
dmesg | grep -i dma
# Look for: "spi-bcm2835 ... DMA channel ... allocated"

# Check DMA channel usage
cat /sys/class/dma/dma*chan*/in_use

For hands-on measurement comparing PIO vs DMA with the BMI160 IMU, see the SPI DMA Optimization tutorial.

Course Overview | Reference Index