DMA Fundamentals
Goal: Understand Direct Memory Access — how the DMA controller moves data between peripherals and memory without CPU involvement, when DMA helps, and how to measure its impact on your embedded Linux system.
Related Tutorials
For hands-on practice, see: SPI DMA Optimization | IIO Buffered Capture
You have an IMU streaming data at 1600 Hz over SPI, a display refreshing at 60 FPS, and a control loop that must run every 5 ms. If the CPU has to move every byte of sensor and display data by hand, it cannot keep up. DMA (Direct Memory Access) lets a dedicated hardware controller move data while the CPU runs your application code.
1. What Is DMA?
In a CPU-driven (PIO) transfer, the CPU reads one byte from a peripheral register, writes it to RAM, repeats — 100% busy for the entire transfer:
CPU-driven (PIO):
CPU ──read──► SPI DR ──store──► RAM ──read──► SPI DR ──store──► RAM ...
CPU is busy the entire time. No cycles left for your application.
With DMA, the CPU configures the DMA controller once (source, destination, length) and then is free while the DMA engine moves the data autonomously:
DMA-driven:
CPU: configure DMA (src=SPI, dst=RAM, len=4096) → done, go do other work
DMA: SPI DR ──► RAM ──► SPI DR ──► RAM ... (hardware handles it)
DMA: interrupt → "transfer complete"
CPU: process the buffer
graph LR
CPU[CPU] -->|configure| DMA[DMA Controller]
DMA -->|read| PERIPH[SPI / I2C<br>Peripheral]
DMA -->|write| MEM[System<br>Memory]
DMA -->|IRQ: done| CPU
CPU -->|free during<br>transfer| APP[Application<br>Code]
| CPU-Driven (PIO) | DMA | |
|---|---|---|
| CPU usage during transfer | 100% | Near 0% |
| Throughput | Limited by CPU speed | Limited by bus/memory speed |
| Latency | Low (immediate start) | Higher (DMA setup overhead) |
| Best for | Small transfers (<64 bytes) | Large transfers (>256 bytes) |
2. DMA on the Raspberry Pi
BCM2835/BCM2711 DMA Controllers
The Raspberry Pi's SoC has a multi-channel DMA controller:
| SoC | DMA channels | Lite channels | Notes |
|---|---|---|---|
| BCM2835 (Pi 1/Zero) | 15 | Channels 0–6 (full), 7–14 (lite) | Lite channels: max 64KB, no 2D stride |
| BCM2711 (Pi 4) | 15 + 4 | Standard + 4 DMA4 channels | DMA4 supports 40-bit addressing |
SPI DMA Threshold
The SPI controller driver (spi-bcm2835) automatically switches between PIO and DMA based on transfer size:
Transfer size:
< 96 bytes → PIO (CPU copies bytes directly)
≥ 96 bytes → DMA (DMA controller moves the data)
This threshold (~96 bytes on BCM2835) exists because DMA setup has overhead — for tiny transfers, PIO is actually faster.
Bus-Specific DMA Usage
| Bus | Typical DMA usage | Why |
|---|---|---|
| SPI | Automatic above threshold | Large transfers (display, IMU FIFO bursts) |
| I2C | Generally PIO | Transfers too small (sensor reads are 2–6 bytes) |
| HDMI | GPU DMA (separate path) | Continuous scan-out, dedicated DMA channel |
| SPI displays | SPI DMA for framebuffer writes | ~150 KB per frame at 320×480 RGB565 |
| SD/eMMC | Always DMA | Block transfers are 512+ bytes |
3. Linux DMA API (for Driver Authors)
The kernel provides three levels of DMA abstraction. Most driver authors use the first two; the third is used internally by bus subsystems.
Coherent DMA
void *buf;
dma_addr_t dma_handle;
buf = dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL);
/* buf: CPU virtual address
dma_handle: bus address the DMA controller uses */
- Kernel and device see the same memory — no cache flush needed
- Used for long-lived buffers: ring buffers, descriptor tables, command queues
- Memory is uncacheable → slightly slower CPU access, but always consistent
- Free with
dma_free_coherent(dev, size, buf, dma_handle)
Streaming DMA
dma_addr_t dma_handle;
/* Map an existing buffer for DMA (one-shot) */
dma_handle = dma_map_single(dev, buf, size, DMA_FROM_DEVICE);
/* ... DMA transfer happens ... */
/* Sync before CPU reads the data */
dma_sync_single_for_cpu(dev, dma_handle, size, DMA_FROM_DEVICE);
/* Unmap when done */
dma_unmap_single(dev, dma_handle, size, DMA_FROM_DEVICE);
- Maps existing buffers for one-shot or short-lived DMA transfers
- Requires explicit sync (
dma_sync_single_for_cpu) before CPU reads DMA'd data - Used for per-transfer data: SPI message buffers, network packet buffers
- Direction:
DMA_TO_DEVICE(CPU→peripheral),DMA_FROM_DEVICE(peripheral→CPU),DMA_BIDIRECTIONAL
DMA Engine API
struct dma_chan *chan;
struct dma_async_tx_descriptor *desc;
chan = dma_request_chan(dev, "rx");
desc = dmaengine_prep_slave_single(chan, dma_handle, size,
DMA_DEV_TO_MEM, DMA_PREP_INTERRUPT);
desc->callback = my_dma_complete;
dmaengine_submit(desc);
dma_async_issue_pending(chan);
- Higher-level abstraction for peripheral DMA (memory-to-peripheral and peripheral-to-memory)
- Used internally by the SPI, I2C, and UART subsystems
- As a driver author using SPI/I2C, you get this for free — the bus subsystem handles it
Scatter-Gather
struct scatterlist sg[N_PAGES];
int n_mapped;
sg_init_table(sg, N_PAGES);
for (i = 0; i < N_PAGES; i++)
sg_set_page(&sg[i], pages[i], PAGE_SIZE, 0);
n_mapped = dma_map_sg(dev, sg, N_PAGES, DMA_FROM_DEVICE);
- Maps non-contiguous physical pages as a single logical DMA transfer
- Essential when your buffer spans multiple pages (common for large allocations)
- The DMA controller handles the page boundaries transparently
4. DMA in Practice
SPI Subsystem: Automatic DMA
When you call spi_sync() or spi_async() in your driver, the SPI core automatically uses DMA for transfers above the threshold. Driver authors get DMA for free — no DMA API calls needed:
/* Your driver code — no DMA awareness needed */
struct spi_transfer t = {
.rx_buf = buf,
.len = 2400, /* Above threshold → SPI core uses DMA */
};
spi_sync_transfer(spi, &t, 1);
IIO Buffered Mode
The IIO subsystem leverages DMA through the bus subsystem:
Hardware trigger (data-ready IRQ or hrtimer)
│
▼
IIO trigger handler in driver
│
▼
spi_sync() to read sensor data
│
├── < threshold → PIO
└── ≥ threshold → DMA (automatic)
│
▼
iio_push_to_buffers_with_timestamp()
│
▼
kfifo ring buffer
│
▼
Userspace reads /dev/iio:deviceN
Framebuffer: DMA vs Deferred I/O
| Display type | Data path | DMA? |
|---|---|---|
| SPI TFT (ILI9341, ST7789) | SPI DMA for bulk pixel writes | Yes — large transfers |
| SPI OLED (SSD1306) | SPI DMA possible but small (1 KB) | Marginal benefit |
| I2C OLED (SSD1306) | PIO (I2C transfers are small) | No |
| HDMI / DSI | GPU DMA scanout | Yes — dedicated path |
For the SSD1306 OLED framebuffer driver: the fbdev deferred I/O mechanism tracks dirty pages via page faults, then flushes changed regions to the display. This is not DMA — it's CPU-driven I2C/SPI writes triggered by page faults. True DMA benefits appear with larger displays (SPI TFTs at 320×480).
When DMA Helps vs When It's Overkill
| Scenario | Transfer size | DMA benefit |
|---|---|---|
| MCP9808 temperature read (I2C) | 2 bytes | None — PIO is faster |
| BMI160 single-axis read (SPI) | 2 bytes | None |
| BMI160 FIFO burst (SPI, 200 samples) | 2400 bytes | Significant CPU savings |
| SSD1306 OLED full-screen write (SPI) | 1024 bytes | Moderate |
| ILI9341 TFT full-screen write (SPI) | 153,600 bytes | Essential |
| SD card block write | 512+ bytes | Always used |
5. Measuring DMA Impact
CPU Load: mpstat
Compare CPU utilization with and without DMA-sized transfers:
# Monitor per-second CPU stats while your application runs
mpstat 1
# Example output during PIO sensor reads at 200 Hz:
# %usr %sys %idle
# 12.3 8.7 79.0
# Same workload with DMA (FIFO burst reads):
# %usr %sys %idle
# 12.1 1.2 86.7
The %sys column shows kernel CPU time — DMA reduces this significantly for large transfers.
Bus Cycle Counters: perf stat
# Count CPU cycles spent in SPI-related kernel functions
sudo perf stat -e cycles,instructions,cache-misses \
-p $(pidof your_app) -- sleep 5
DMA Channel Tracing: ftrace
# Trace DMA channel setup and completion
echo 1 > /sys/kernel/debug/tracing/events/dma/enable
cat /sys/kernel/debug/tracing/trace_pipe
Verifying DMA Is Active
# Check for SPI DMA channel allocation at boot
dmesg | grep -i dma
# Look for: "spi-bcm2835 ... DMA channel ... allocated"
# Check DMA channel usage
cat /sys/class/dma/dma*chan*/in_use
For hands-on measurement comparing PIO vs DMA with the BMI160 IMU, see the SPI DMA Optimization tutorial.