CPU Limits and DMA: When Software Isn't Fast Enough

Advanced Reference | For students who want to understand hardware data transfer

This document explains when the CPU becomes a bottleneck for data transfer and how DMA (Direct Memory Access) solves this problem.

The Problem: CPU as Data Mover

In simple embedded programs, the CPU handles everything:

# CPU reads sensor, stores value, repeats
while True:
    value = sensor.read()      # CPU fetches data
    buffer.append(value)       # CPU stores data
    # ... CPU does this thousands of times

This works fine at low speeds. But what happens when you need to: - Sample an ADC at 44.1 kHz (audio)? - Read IMU data at 1000 Hz continuously? - Stream data to/from an SD card?

The CPU becomes the bottleneck.

Calculating CPU Load for Data Transfer

Example 1: IMU Data Collection

Your robot's BMI160 IMU communicates over I2C. Let's calculate the limits.

I2C Protocol Overhead:

I2C transaction to read 6 bytes (accel X, Y, Z):
┌─────────────────────────────────────────────────────────────┐
│  START │ ADDR+W │ ACK │ REG │ ACK │ RESTART │ ADDR+R │ ACK │
│   1    │   8    │  1  │  8  │  1  │    1    │   8    │  1  │
├─────────────────────────────────────────────────────────────┤
│  DATA0 │ ACK │ DATA1 │ ACK │ ... │ DATA5 │ NACK │ STOP    │
│   8    │  1  │   8   │  1  │ ... │   8   │   1  │   1     │
└─────────────────────────────────────────────────────────────┘

Total bits: ~90 bits per 6-byte read
At 400 kHz I2C: 90 / 400,000 = 225 µs hardware time

MicroPython Overhead:

# Measured on RP2350 with MicroPython
import time
from machine import I2C, Pin

i2c = I2C(0, scl=Pin(15), sda=Pin(14), freq=400000)

# Measure single read
start = time.ticks_us()
for _ in range(1000):
    data = i2c.readfrom_mem(0x68, 0x12, 6)
elapsed = time.ticks_diff(time.ticks_us(), start)

print(f"Per read: {elapsed/1000:.1f} µs")
# Result: ~450-550 µs per read (interpreter overhead!)

CPU Load Calculation:

Sample Rate	Time per Second	CPU Load	Feasible?
100 Hz	50 ms	5%	✅ Easy
200 Hz	100 ms	10%	✅ Good
500 Hz	250 ms	25%	⚠️ Tight
1000 Hz	500 ms	50%	⚠️ Marginal
2000 Hz	1000 ms	100%	❌ Impossible

Key insight: At 1000 Hz, the CPU spends 50% of its time just moving IMU data. No time left for control calculations!

Example 2: ADC Sampling

The RP2350's ADC can sample at 500 kHz in hardware. But MicroPython can't keep up:

from machine import ADC
import time

adc = ADC(26)

# Measure ADC read speed
start = time.ticks_us()
for _ in range(10000):
    val = adc.read_u16()
elapsed = time.ticks_diff(time.ticks_us(), start)

print(f"Per read: {elapsed/10000:.1f} µs")
# Result: ~25-50 µs per read

Application	Required Rate	Python Achievable	Gap
Line sensor	100 Hz	✅ 20,000 Hz	OK
Battery monitor	10 Hz	✅ 20,000 Hz	OK
Audio (mono)	44,100 Hz	❌ 20,000 Hz	2× short
Audio (stereo)	88,200 Hz	❌ 20,000 Hz	4× short
Oscilloscope	1 MHz	❌ 20,000 Hz	50× short

The Solution: DMA (Direct Memory Access)

DMA is a hardware peripheral that moves data without CPU involvement.

┌─────────────────────────────────────────────────────────────┐
│  WITHOUT DMA                                                 │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   ADC ──► CPU ──► RAM                                        │
│           │                                                  │
│           └── CPU busy for EVERY sample                      │
│                                                              │
├─────────────────────────────────────────────────────────────┤
│  WITH DMA                                                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   ADC ──────────► DMA ──────────► RAM                        │
│                    │                                         │
│   CPU: [other work] │ [other work] │ [process batch]        │
│                    │                                         │
│                    └── DMA handles transfer, CPU is FREE     │
│                                                              │
└─────────────────────────────────────────────────────────────┘

How DMA Works

Configure: Tell DMA where to read from (peripheral), where to write (RAM), how many bytes
Start: Trigger the DMA channel
Transfer: DMA moves data automatically, byte by byte
Interrupt: DMA signals CPU when done (optional)
Process: CPU processes the collected batch

// Conceptual C code (not MicroPython)
// Configure DMA to read 1000 ADC samples into buffer
dma_channel_configure(
    dma_chan,
    &cfg,
    buffer,          // Write to this RAM address
    &adc_hw->fifo,   // Read from ADC FIFO
    1000,            // Transfer 1000 samples
    true             // Start immediately
);

// CPU is now FREE while DMA collects samples
do_other_work();

// Wait for DMA to finish
dma_channel_wait_for_finish_blocking(dma_chan);

// Now process the 1000 samples
process_buffer(buffer, 1000);

DMA Advantages

Aspect	CPU Polling	DMA
CPU load during transfer	100%	~0%
Maximum throughput	Limited by software	Limited by hardware
Timing jitter	High (interrupts, GC)	Zero (hardware)
Power consumption	Higher	Lower
Complexity	Simple code	Requires configuration

When to Use DMA

Decision Framework

┌─────────────────────────────────────────────────────────────┐
│  DO I NEED DMA?                                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   Sample rate × Transfer time > 20% CPU?                     │
│                    │                                         │
│          ┌────────┴────────┐                                │
│          │                 │                                 │
│         YES               NO                                 │
│          │                 │                                 │
│          ▼                 ▼                                 │
│   Consider DMA       CPU polling OK                          │
│                                                              │
│   Also consider DMA if:                                      │
│   • Timing jitter is unacceptable                           │
│   • Need deterministic latency                              │
│   • Power consumption matters                               │
│   • Continuous streaming required                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Common DMA Use Cases

Application	Without DMA	With DMA
Audio playback	Glitchy, drops samples	Smooth, continuous
High-speed ADC	Limited to ~20 kHz	Up to 500 kHz
SD card access	Slow, blocks CPU	Fast, background
Display update	Slow refresh, tearing	Fast, tear-free
Motor encoder	May miss pulses	Captures all edges

DMA on the RP2350

The RP2350 has 16 DMA channels, each can: - Read from any peripheral or memory - Write to any peripheral or memory - Transfer up to 4GB (practically limited by RAM) - Chain to other channels for complex transfers - Trigger interrupts on completion

RP2350 DMA-Capable Peripherals

Peripheral	DMA Read	DMA Write	Common Use
ADC	✅	—	High-speed sampling
SPI	✅	✅	SD cards, displays
I2C	✅	✅	Bulk sensor reads
UART	✅	✅	Serial streaming
PIO	✅	✅	Custom protocols
PWM	—	✅	Waveform generation

MicroPython and DMA

MicroPython uses DMA internally for some operations:

Operation	Uses DMA?	Notes
`neopixel.write()`	✅ Yes	PIO + DMA
`spi.write()`	✅ Yes	For large transfers
`i2c.readfrom()`	❌ No	CPU polling
`adc.read_u16()`	❌ No	Single sample
`machine.I2S`	✅ Yes	Audio streaming

To use DMA directly, you need: - C/C++ with the Pico SDK, or - Custom MicroPython modules, or - PIO programs (which can use DMA)

Practical Limits for This Course

For ES101, we stay within Python's limits:

Task	Our Rate	Python Limit	Margin
Line sensors	100 Hz	2000 Hz	20× OK
IMU reading	100 Hz	1000 Hz	10× OK
Ultrasonic	10 Hz	100 Hz	10× OK
Battery voltage	1 Hz	1000 Hz	1000× OK
Control loop	100 Hz	500 Hz	5× OK

We don't need DMA because our sample rates are well within Python's capabilities.

But understanding DMA helps you: 1. Know when Python won't be enough 2. Understand what the C SDK provides 3. Make informed decisions in future projects 4. Debug "impossible" timing problems

The Complete Picture

┌─────────────────────────────────────────────────────────────┐
│  DATA TRANSFER HIERARCHY                                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   Speed        Method              CPU Load    Use When      │
│   ─────────────────────────────────────────────────────────  │
│   Slowest      Python polling      High        Prototyping   │
│                                                              │
│   ↓            C polling           Medium      Simple apps   │
│                                                              │
│   ↓            Interrupts          Low*        Events        │
│                                               (*short ISRs)  │
│                                                              │
│   ↓            DMA                 ~Zero       Streaming     │
│                                                              │
│   Fastest      DMA + PIO           ~Zero       Custom HW     │
│                                                              │
│   Each level trades complexity for performance.              │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Summary

CPU load = Sample rate × Time per sample

When this exceeds ~20-30%, consider hardware assistance (DMA).

Concept	Key Point
CPU bottleneck	Software can't keep up with hardware speeds
DMA purpose	Move data without CPU involvement
When to use	High sample rates, continuous streaming, low jitter
In this course	Not needed, but understanding helps design decisions

Why MicroPython is Slow — Interpreter overhead explained
Execution Models — Polling, interrupts, and architectural patterns
Tutorial: Robot Unboxing — Embedded system fundamentals