Skip to content

DMA Fundamentals

Goal: Understand Direct Memory Access — how the DMA controller moves data between peripherals and memory without CPU involvement, when DMA helps, and how to measure its impact on your embedded Linux system.

Related Tutorials

For hands-on practice, see: SPI DMA Optimization | IIO Buffered Capture


You have an IMU streaming data at 1600 Hz over SPI, a display refreshing at 60 FPS, and a control loop that must run every 5 ms. If the CPU has to move every byte of sensor and display data by hand, it cannot keep up. DMA (Direct Memory Access) lets a dedicated hardware controller move data while the CPU runs your application code.


1. What Is DMA?

In a CPU-driven (PIO) transfer, the CPU reads one byte from a peripheral register, writes it to RAM, repeats — 100% busy for the entire transfer:

CPU-driven (PIO):
  CPU ──read──► SPI DR ──store──► RAM ──read──► SPI DR ──store──► RAM ...
  CPU is busy the entire time. No cycles left for your application.

With DMA, the CPU configures the DMA controller once (source, destination, length) and then is free while the DMA engine moves the data autonomously:

DMA-driven:
  CPU: configure DMA (src=SPI, dst=RAM, len=4096) → done, go do other work
  DMA: SPI DR ──► RAM ──► SPI DR ──► RAM ... (hardware handles it)
  DMA: interrupt → "transfer complete"
  CPU: process the buffer
graph LR
    CPU[CPU] -->|configure| DMA[DMA Controller]
    DMA -->|read| PERIPH[SPI / I2C<br>Peripheral]
    DMA -->|write| MEM[System<br>Memory]
    DMA -->|IRQ: done| CPU
    CPU -->|free during<br>transfer| APP[Application<br>Code]
CPU-Driven (PIO) DMA
CPU usage during transfer 100% Near 0%
Throughput Limited by CPU speed Limited by bus/memory speed
Latency Low (immediate start) Higher (DMA setup overhead)
Best for Small transfers (<64 bytes) Large transfers (>256 bytes)

2. DMA on the Raspberry Pi

BCM2835/BCM2711 DMA Controllers

The Raspberry Pi's SoC has a multi-channel DMA controller:

SoC DMA channels Lite channels Notes
BCM2835 (Pi 1/Zero) 15 Channels 0–6 (full), 7–14 (lite) Lite channels: max 64KB, no 2D stride
BCM2711 (Pi 4) 15 + 4 Standard + 4 DMA4 channels DMA4 supports 40-bit addressing

SPI DMA Threshold

The SPI controller driver (spi-bcm2835) automatically switches between PIO and DMA based on transfer size:

Transfer size:
  < 96 bytes  →  PIO (CPU copies bytes directly)
  ≥ 96 bytes  →  DMA (DMA controller moves the data)

This threshold (~96 bytes on BCM2835) exists because DMA setup has overhead — for tiny transfers, PIO is actually faster.

Bus-Specific DMA Usage

Bus Typical DMA usage Why
SPI Automatic above threshold Large transfers (display, IMU FIFO bursts)
I2C Generally PIO Transfers too small (sensor reads are 2–6 bytes)
HDMI GPU DMA (separate path) Continuous scan-out, dedicated DMA channel
SPI displays SPI DMA for framebuffer writes ~150 KB per frame at 320×480 RGB565
SD/eMMC Always DMA Block transfers are 512+ bytes

3. Linux DMA API (for Driver Authors)

The kernel provides three levels of DMA abstraction. Most driver authors use the first two; the third is used internally by bus subsystems.

Coherent DMA

void *buf;
dma_addr_t dma_handle;

buf = dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL);
/* buf: CPU virtual address
   dma_handle: bus address the DMA controller uses */
  • Kernel and device see the same memory — no cache flush needed
  • Used for long-lived buffers: ring buffers, descriptor tables, command queues
  • Memory is uncacheable → slightly slower CPU access, but always consistent
  • Free with dma_free_coherent(dev, size, buf, dma_handle)

Streaming DMA

dma_addr_t dma_handle;

/* Map an existing buffer for DMA (one-shot) */
dma_handle = dma_map_single(dev, buf, size, DMA_FROM_DEVICE);

/* ... DMA transfer happens ... */

/* Sync before CPU reads the data */
dma_sync_single_for_cpu(dev, dma_handle, size, DMA_FROM_DEVICE);

/* Unmap when done */
dma_unmap_single(dev, dma_handle, size, DMA_FROM_DEVICE);
  • Maps existing buffers for one-shot or short-lived DMA transfers
  • Requires explicit sync (dma_sync_single_for_cpu) before CPU reads DMA'd data
  • Used for per-transfer data: SPI message buffers, network packet buffers
  • Direction: DMA_TO_DEVICE (CPU→peripheral), DMA_FROM_DEVICE (peripheral→CPU), DMA_BIDIRECTIONAL

DMA Engine API

struct dma_chan *chan;
struct dma_async_tx_descriptor *desc;

chan = dma_request_chan(dev, "rx");
desc = dmaengine_prep_slave_single(chan, dma_handle, size,
                                    DMA_DEV_TO_MEM, DMA_PREP_INTERRUPT);
desc->callback = my_dma_complete;
dmaengine_submit(desc);
dma_async_issue_pending(chan);
  • Higher-level abstraction for peripheral DMA (memory-to-peripheral and peripheral-to-memory)
  • Used internally by the SPI, I2C, and UART subsystems
  • As a driver author using SPI/I2C, you get this for free — the bus subsystem handles it

Scatter-Gather

struct scatterlist sg[N_PAGES];
int n_mapped;

sg_init_table(sg, N_PAGES);
for (i = 0; i < N_PAGES; i++)
    sg_set_page(&sg[i], pages[i], PAGE_SIZE, 0);

n_mapped = dma_map_sg(dev, sg, N_PAGES, DMA_FROM_DEVICE);
  • Maps non-contiguous physical pages as a single logical DMA transfer
  • Essential when your buffer spans multiple pages (common for large allocations)
  • The DMA controller handles the page boundaries transparently

4. DMA in Practice

SPI Subsystem: Automatic DMA

When you call spi_sync() or spi_async() in your driver, the SPI core automatically uses DMA for transfers above the threshold. Driver authors get DMA for free — no DMA API calls needed:

/* Your driver code — no DMA awareness needed */
struct spi_transfer t = {
    .rx_buf = buf,
    .len = 2400,    /* Above threshold → SPI core uses DMA */
};
spi_sync_transfer(spi, &t, 1);

IIO Buffered Mode

The IIO subsystem leverages DMA through the bus subsystem:

Hardware trigger (data-ready IRQ or hrtimer)
IIO trigger handler in driver
spi_sync() to read sensor data
    ├── < threshold → PIO
    └── ≥ threshold → DMA (automatic)
iio_push_to_buffers_with_timestamp()
kfifo ring buffer
Userspace reads /dev/iio:deviceN

Framebuffer: DMA vs Deferred I/O

Display type Data path DMA?
SPI TFT (ILI9341, ST7789) SPI DMA for bulk pixel writes Yes — large transfers
SPI OLED (SSD1306) SPI DMA possible but small (1 KB) Marginal benefit
I2C OLED (SSD1306) PIO (I2C transfers are small) No
HDMI / DSI GPU DMA scanout Yes — dedicated path

For the SSD1306 OLED framebuffer driver: the fbdev deferred I/O mechanism tracks dirty pages via page faults, then flushes changed regions to the display. This is not DMA — it's CPU-driven I2C/SPI writes triggered by page faults. True DMA benefits appear with larger displays (SPI TFTs at 320×480).

When DMA Helps vs When It's Overkill

Scenario Transfer size DMA benefit
MCP9808 temperature read (I2C) 2 bytes None — PIO is faster
BMI160 single-axis read (SPI) 2 bytes None
BMI160 FIFO burst (SPI, 200 samples) 2400 bytes Significant CPU savings
SSD1306 OLED full-screen write (SPI) 1024 bytes Moderate
ILI9341 TFT full-screen write (SPI) 153,600 bytes Essential
SD card block write 512+ bytes Always used

5. Measuring DMA Impact

CPU Load: mpstat

Compare CPU utilization with and without DMA-sized transfers:

# Monitor per-second CPU stats while your application runs
mpstat 1

# Example output during PIO sensor reads at 200 Hz:
#   %usr   %sys   %idle
#   12.3   8.7    79.0

# Same workload with DMA (FIFO burst reads):
#   %usr   %sys   %idle
#   12.1   1.2    86.7

The %sys column shows kernel CPU time — DMA reduces this significantly for large transfers.

Bus Cycle Counters: perf stat

# Count CPU cycles spent in SPI-related kernel functions
sudo perf stat -e cycles,instructions,cache-misses \
    -p $(pidof your_app) -- sleep 5

DMA Channel Tracing: ftrace

# Trace DMA channel setup and completion
echo 1 > /sys/kernel/debug/tracing/events/dma/enable
cat /sys/kernel/debug/tracing/trace_pipe

Verifying DMA Is Active

# Check for SPI DMA channel allocation at boot
dmesg | grep -i dma
# Look for: "spi-bcm2835 ... DMA channel ... allocated"

# Check DMA channel usage
cat /sys/class/dma/dma*chan*/in_use

For hands-on measurement comparing PIO vs DMA with the BMI160 IMU, see the SPI DMA Optimization tutorial.


Course Overview | Reference Index