CPU Limits and DMA: When Software Isn't Fast Enough
Advanced Reference | For students who want to understand hardware data transfer
This document explains when the CPU becomes a bottleneck for data transfer and how DMA (Direct Memory Access) solves this problem.
The Problem: CPU as Data Mover
In simple embedded programs, the CPU handles everything:
# CPU reads sensor, stores value, repeats
while True:
value = sensor.read() # CPU fetches data
buffer.append(value) # CPU stores data
# ... CPU does this thousands of times
This works fine at low speeds. But what happens when you need to: - Sample an ADC at 44.1 kHz (audio)? - Read IMU data at 1000 Hz continuously? - Stream data to/from an SD card?
The CPU becomes the bottleneck.
Calculating CPU Load for Data Transfer
Example 1: IMU Data Collection
Your robot's BMI160 IMU communicates over I2C. Let's calculate the limits.
I2C Protocol Overhead:
I2C transaction to read 6 bytes (accel X, Y, Z):
┌─────────────────────────────────────────────────────────────┐
│ START │ ADDR+W │ ACK │ REG │ ACK │ RESTART │ ADDR+R │ ACK │
│ 1 │ 8 │ 1 │ 8 │ 1 │ 1 │ 8 │ 1 │
├─────────────────────────────────────────────────────────────┤
│ DATA0 │ ACK │ DATA1 │ ACK │ ... │ DATA5 │ NACK │ STOP │
│ 8 │ 1 │ 8 │ 1 │ ... │ 8 │ 1 │ 1 │
└─────────────────────────────────────────────────────────────┘
Total bits: ~90 bits per 6-byte read
At 400 kHz I2C: 90 / 400,000 = 225 µs hardware time
MicroPython Overhead:
# Measured on RP2350 with MicroPython
import time
from machine import I2C, Pin
i2c = I2C(0, scl=Pin(15), sda=Pin(14), freq=400000)
# Measure single read
start = time.ticks_us()
for _ in range(1000):
data = i2c.readfrom_mem(0x68, 0x12, 6)
elapsed = time.ticks_diff(time.ticks_us(), start)
print(f"Per read: {elapsed/1000:.1f} µs")
# Result: ~450-550 µs per read (interpreter overhead!)
CPU Load Calculation:
| Sample Rate | Time per Second | CPU Load | Feasible? |
|---|---|---|---|
| 100 Hz | 50 ms | 5% | ✅ Easy |
| 200 Hz | 100 ms | 10% | ✅ Good |
| 500 Hz | 250 ms | 25% | ⚠️ Tight |
| 1000 Hz | 500 ms | 50% | ⚠️ Marginal |
| 2000 Hz | 1000 ms | 100% | ❌ Impossible |
Key insight: At 1000 Hz, the CPU spends 50% of its time just moving IMU data. No time left for control calculations!
Example 2: ADC Sampling
The RP2350's ADC can sample at 500 kHz in hardware. But MicroPython can't keep up:
from machine import ADC
import time
adc = ADC(26)
# Measure ADC read speed
start = time.ticks_us()
for _ in range(10000):
val = adc.read_u16()
elapsed = time.ticks_diff(time.ticks_us(), start)
print(f"Per read: {elapsed/10000:.1f} µs")
# Result: ~25-50 µs per read
| Application | Required Rate | Python Achievable | Gap |
|---|---|---|---|
| Line sensor | 100 Hz | ✅ 20,000 Hz | OK |
| Battery monitor | 10 Hz | ✅ 20,000 Hz | OK |
| Audio (mono) | 44,100 Hz | ❌ 20,000 Hz | 2× short |
| Audio (stereo) | 88,200 Hz | ❌ 20,000 Hz | 4× short |
| Oscilloscope | 1 MHz | ❌ 20,000 Hz | 50× short |
The Solution: DMA (Direct Memory Access)
DMA is a hardware peripheral that moves data without CPU involvement.
┌─────────────────────────────────────────────────────────────┐
│ WITHOUT DMA │
├─────────────────────────────────────────────────────────────┤
│ │
│ ADC ──► CPU ──► RAM │
│ │ │
│ └── CPU busy for EVERY sample │
│ │
├─────────────────────────────────────────────────────────────┤
│ WITH DMA │
├─────────────────────────────────────────────────────────────┤
│ │
│ ADC ──────────► DMA ──────────► RAM │
│ │ │
│ CPU: [other work] │ [other work] │ [process batch] │
│ │ │
│ └── DMA handles transfer, CPU is FREE │
│ │
└─────────────────────────────────────────────────────────────┘
How DMA Works
- Configure: Tell DMA where to read from (peripheral), where to write (RAM), how many bytes
- Start: Trigger the DMA channel
- Transfer: DMA moves data automatically, byte by byte
- Interrupt: DMA signals CPU when done (optional)
- Process: CPU processes the collected batch
// Conceptual C code (not MicroPython)
// Configure DMA to read 1000 ADC samples into buffer
dma_channel_configure(
dma_chan,
&cfg,
buffer, // Write to this RAM address
&adc_hw->fifo, // Read from ADC FIFO
1000, // Transfer 1000 samples
true // Start immediately
);
// CPU is now FREE while DMA collects samples
do_other_work();
// Wait for DMA to finish
dma_channel_wait_for_finish_blocking(dma_chan);
// Now process the 1000 samples
process_buffer(buffer, 1000);
DMA Advantages
| Aspect | CPU Polling | DMA |
|---|---|---|
| CPU load during transfer | 100% | ~0% |
| Maximum throughput | Limited by software | Limited by hardware |
| Timing jitter | High (interrupts, GC) | Zero (hardware) |
| Power consumption | Higher | Lower |
| Complexity | Simple code | Requires configuration |
When to Use DMA
Decision Framework
┌─────────────────────────────────────────────────────────────┐
│ DO I NEED DMA? │
├─────────────────────────────────────────────────────────────┤
│ │
│ Sample rate × Transfer time > 20% CPU? │
│ │ │
│ ┌────────┴────────┐ │
│ │ │ │
│ YES NO │
│ │ │ │
│ ▼ ▼ │
│ Consider DMA CPU polling OK │
│ │
│ Also consider DMA if: │
│ • Timing jitter is unacceptable │
│ • Need deterministic latency │
│ • Power consumption matters │
│ • Continuous streaming required │
│ │
└─────────────────────────────────────────────────────────────┘
Common DMA Use Cases
| Application | Without DMA | With DMA |
|---|---|---|
| Audio playback | Glitchy, drops samples | Smooth, continuous |
| High-speed ADC | Limited to ~20 kHz | Up to 500 kHz |
| SD card access | Slow, blocks CPU | Fast, background |
| Display update | Slow refresh, tearing | Fast, tear-free |
| Motor encoder | May miss pulses | Captures all edges |
DMA on the RP2350
The RP2350 has 16 DMA channels, each can: - Read from any peripheral or memory - Write to any peripheral or memory - Transfer up to 4GB (practically limited by RAM) - Chain to other channels for complex transfers - Trigger interrupts on completion
RP2350 DMA-Capable Peripherals
| Peripheral | DMA Read | DMA Write | Common Use |
|---|---|---|---|
| ADC | ✅ | — | High-speed sampling |
| SPI | ✅ | ✅ | SD cards, displays |
| I2C | ✅ | ✅ | Bulk sensor reads |
| UART | ✅ | ✅ | Serial streaming |
| PIO | ✅ | ✅ | Custom protocols |
| PWM | — | ✅ | Waveform generation |
MicroPython and DMA
MicroPython uses DMA internally for some operations:
| Operation | Uses DMA? | Notes |
|---|---|---|
neopixel.write() |
✅ Yes | PIO + DMA |
spi.write() |
✅ Yes | For large transfers |
i2c.readfrom() |
❌ No | CPU polling |
adc.read_u16() |
❌ No | Single sample |
machine.I2S |
✅ Yes | Audio streaming |
To use DMA directly, you need: - C/C++ with the Pico SDK, or - Custom MicroPython modules, or - PIO programs (which can use DMA)
Practical Limits for This Course
For ES101, we stay within Python's limits:
| Task | Our Rate | Python Limit | Margin |
|---|---|---|---|
| Line sensors | 100 Hz | 2000 Hz | 20× OK |
| IMU reading | 100 Hz | 1000 Hz | 10× OK |
| Ultrasonic | 10 Hz | 100 Hz | 10× OK |
| Battery voltage | 1 Hz | 1000 Hz | 1000× OK |
| Control loop | 100 Hz | 500 Hz | 5× OK |
We don't need DMA because our sample rates are well within Python's capabilities.
But understanding DMA helps you: 1. Know when Python won't be enough 2. Understand what the C SDK provides 3. Make informed decisions in future projects 4. Debug "impossible" timing problems
The Complete Picture
┌─────────────────────────────────────────────────────────────┐
│ DATA TRANSFER HIERARCHY │
├─────────────────────────────────────────────────────────────┤
│ │
│ Speed Method CPU Load Use When │
│ ───────────────────────────────────────────────────────── │
│ Slowest Python polling High Prototyping │
│ │
│ ↓ C polling Medium Simple apps │
│ │
│ ↓ Interrupts Low* Events │
│ (*short ISRs) │
│ │
│ ↓ DMA ~Zero Streaming │
│ │
│ Fastest DMA + PIO ~Zero Custom HW │
│ │
│ Each level trades complexity for performance. │
│ │
└─────────────────────────────────────────────────────────────┘
Summary
CPU load = Sample rate × Time per sample
When this exceeds ~20-30%, consider hardware assistance (DMA).
| Concept | Key Point |
|---|---|
| CPU bottleneck | Software can't keep up with hardware speeds |
| DMA purpose | Move data without CPU involvement |
| When to use | High sample rates, continuous streaming, low jitter |
| In this course | Not needed, but understanding helps design decisions |
Related Content
- Why MicroPython is Slow — Interpreter overhead explained
- Execution Models — Polling, interrupts, and architectural patterns
- Tutorial: Robot Unboxing — Embedded system fundamentals