Why MicroPython is Slow: From Silicon to Interpreter
Advanced Reference | For students who want to understand the full technical picture
This document traces a simple operation (x += 1) from Python source code down to ARM assembly, explaining exactly why interpreted Python is ~200× slower than compiled C.
The Question
"I know C. I've used Arduino. Why does MicroPython feel so sluggish?"
The answer is architectural, not "bad design." Let's trace what actually happens.
Part 1: C on Cortex-M33 (What Really Happens)
We'll use the ARM Cortex-M33 (the CPU in the RP2350/Pico 2) as our reference.
Example: Increment a Global Variable
Case A: Variable in Register (Best Case)
If the compiler keeps x in a register (e.g., r0):
ONE instruction. On Cortex-M33: - Executes in 1 CPU cycle - No memory access - Deterministic timing
Case B: Variable in RAM (Realistic Case)
For global or volatile variables (the common embedded case):
ldr r0, [r1] ; Load x from memory into r0
adds r0, r0, #1 ; Increment r0
str r0, [r1] ; Store r0 back to memory
3 instructions. Timing (approximate):
| Instruction | Cycles |
|---|---|
ldr |
1-2 |
adds |
1 |
str |
1-2 |
| Total | 3-5 cycles |
At 150 MHz: ~20-30 nanoseconds
Why Is C So Fast?
The compiler resolved everything at compile time:
| Question | Answered When? |
|---|---|
What type is x? |
Compile time |
Where is x stored? |
Compile time |
How many bytes is x? |
Compile time |
| What instruction increments it? | Compile time |
The CPU executes pure intent. No decisions at runtime.
Part 2: MicroPython Execution Pipeline
When you write Python:
It goes through multiple stages:
┌─────────────────────────────────────────────────────────────┐
│ MICROPYTHON EXECUTION PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ Python source code │
│ ↓ │
│ Parser (tokenize, build AST) │
│ ↓ │
│ Compiler (generate bytecode) │
│ ↓ │
│ Bytecode (stored in memory) │
│ ↓ │
│ Interpreter loop (C code) ←── THIS IS WHERE TIME GOES │
│ ↓ │
│ Runtime object system │
│ ↓ │
│ ARM machine code │
│ ↓ │
│ CPU execution │
│ │
└─────────────────────────────────────────────────────────────┘
Part 3: MicroPython Bytecode
MicroPython compiles x += 1 into bytecode similar to:
LOAD_NAME x ; Look up variable 'x'
LOAD_CONST 1 ; Load the constant 1
BINARY_OP ADD ; Add the two values
STORE_NAME x ; Store result back to 'x'
Important: This bytecode is NOT executed by hardware. It's interpreted by a C program running on the CPU.
Part 4: What Each Bytecode Actually Does
LOAD_NAME x (~50-100 instructions)
This single bytecode operation involves:
- Get the name "x" - It's a string pointer
- Hash the string - Compute hash("x") for dictionary lookup
- Search the local namespace - Dictionary lookup
- If not found, search global namespace - Another dictionary lookup
- Follow the pointer - Retrieve the object reference
- Push to VM stack - Store for next operation
// Simplified pseudocode of what the interpreter does:
mp_obj_t lookup_name(qstr name) {
// Try local scope
mp_obj_t *local = mp_locals_get(name);
if (local != NULL) return *local;
// Try global scope
mp_obj_t *global = mp_globals_get(name);
if (global != NULL) return *global;
// Name error!
mp_raise_NameError(name);
}
Each step involves: - Function calls - Pointer chasing - Conditional branches - Possible cache misses
LOAD_CONST 1 (~20-40 instructions)
Even the constant 1 is not a raw number—it's a Python object:
// MicroPython integer object structure
typedef struct _mp_obj_int_t {
mp_obj_base_t base; // Type information
mp_int_t value; // The actual integer value
} mp_obj_int_t;
The interpreter must: 1. Look up the constant in the constants table 2. Get or create the integer object 3. Push object pointer to VM stack
BINARY_OP ADD (~100-300 instructions)
This is the most expensive operation. The interpreter must:
- Pop two operands from VM stack
- Check type of left operand - Is it an int? float? string? list?
- Check type of right operand - Same question
- Resolve the correct add function - Integer add? Float add? String concatenation?
- Call the add function
- Allocate a NEW integer object for the result (integers are immutable!)
- Initialize the new object - Set type, value, refcount
- Push result to VM stack
// Simplified pseudocode
mp_obj_t binary_op_add(mp_obj_t lhs, mp_obj_t rhs) {
// Type dispatch - many branches!
if (mp_obj_is_small_int(lhs) && mp_obj_is_small_int(rhs)) {
mp_int_t lval = MP_OBJ_SMALL_INT_VALUE(lhs);
mp_int_t rval = MP_OBJ_SMALL_INT_VALUE(rhs);
return mp_obj_new_int(lval + rval); // Allocates!
} else if (mp_obj_is_float(lhs) || mp_obj_is_float(rhs)) {
// Float addition path...
} else if (mp_obj_is_str(lhs) && mp_obj_is_str(rhs)) {
// String concatenation path...
}
// ... many more type combinations
}
STORE_NAME x (~50-100 instructions)
- Pop result from VM stack
- Hash "x" again (or use cached hash)
- Find dictionary entry
- Decrement refcount of old value
- Check if old value should be garbage collected
- Store new value pointer
- Increment refcount of new value
Part 5: The Interpreter Loop Overhead
Every bytecode runs inside the interpreter main loop:
// Simplified MicroPython interpreter loop
void mp_execute_bytecode(mp_code_state_t *code_state) {
const byte *ip = code_state->ip; // Instruction pointer
while (1) {
byte opcode = *ip++; // Fetch next opcode
switch (opcode) {
case MP_BC_LOAD_NAME:
// ... 50-100 instructions ...
break;
case MP_BC_LOAD_CONST:
// ... 20-40 instructions ...
break;
case MP_BC_BINARY_OP:
// ... 100-300 instructions ...
break;
case MP_BC_STORE_NAME:
// ... 50-100 instructions ...
break;
// ... hundreds of other opcodes ...
}
}
}
This loop adds overhead for every single bytecode:
| Overhead Source | Instructions |
|---|---|
| Fetch opcode from memory | 2-4 |
| Switch/dispatch table lookup | 5-15 |
| Branch to handler | 2-5 |
| Loop iteration | 3-5 |
| Total per opcode | ~10-30 |
With 4 bytecodes for x += 1, that's ~40-120 extra instructions just for dispatching.
Part 6: Total Instruction Count
C: x++
| Level | Count |
|---|---|
| ARM instructions | 3 |
| Time at 150 MHz | ~20-30 ns |
MicroPython: x += 1
| Stage | Instructions |
|---|---|
| LOAD_NAME | ~80 |
| LOAD_CONST | ~30 |
| BINARY_OP ADD | ~200 |
| STORE_NAME | ~80 |
| Interpreter dispatch (×4) | ~80 |
| Total | ~500-700 |
| Time at 150 MHz | ~4-6 µs |
The Ratio
┌─────────────────────────────────────────────────────────────┐
│ │
│ C: x++; → 3 instructions → ~25 ns │
│ Python: x += 1 → 600 instructions → ~5 µs │
│ │
│ RATIO: ~200× SLOWER │
│ │
│ This is NORMAL for interpreted languages. │
│ │
└─────────────────────────────────────────────────────────────┘
Part 7: Why Python MUST Do This
Python allows:
x = 1 # x is an integer
x = "hello" # now x is a string
x = [1, 2, 3] # now x is a list
x += 1 # What does += mean now? Depends on type!
At any point, any variable can hold any type.
So the interpreter must: - Check types at every operation - Use generic object representations - Resolve operations dynamically - Handle type errors at runtime
Flexibility is paid for on every single operation.
Part 8: The Compiler vs Runtime Trade-off
| Question | C (Compiler) | Python (Runtime) |
|---|---|---|
What type is x? |
Known at compile time | Checked every operation |
Where is x? |
Fixed address | Dictionary lookup |
What does + mean? |
Fixed instruction | Type-dependent dispatch |
| Memory allocation? | Stack/static | Heap + GC |
| Error handling? | Compile-time (mostly) | Runtime exceptions |
C expresses intent to the compiler. Python expresses intent to the runtime.
The compiler is orders of magnitude faster than a runtime interpreter.
Part 9: When MicroPython Can Be "Fast Enough"
MicroPython is efficient when you:
1. Avoid Tight Loops
# SLOW: Python loop with many iterations
total = 0
for i in range(10000):
total += i
# FASTER: Use built-in (implemented in C)
total = sum(range(10000))
2. Use Native Modules
# These are implemented in C, not Python:
from machine import Pin, PWM
from neopixel import NeoPixel
# The Python call is slow, but the actual work is fast
led.on() # Python overhead, then C does the work
3. Push Timing-Critical Code to Hardware
# DON'T: Bit-bang in Python
for bit in data:
pin.value(bit)
time.sleep_us(1) # Way too slow!
# DO: Let PIO handle it
leds.write() # Python just triggers; PIO does timing
4. Design Around Milliseconds, Not Microseconds
| Time Scale | Python? | Example |
|---|---|---|
| Nanoseconds | ✗ No | WS2812 protocol |
| Microseconds | ✗ Marginal | Ultrasonic echo |
| Milliseconds | ✓ Yes | Button debounce, display update |
| Seconds | ✓ Yes | State machine transitions |
Part 10: The Complete Picture
┌────────────────────────────────────────────────────────────────┐
│ EMBEDDED SYSTEM ARCHITECTURE │
├────────────────────────────────────────────────────────────────┤
│ │
│ Layer │ Language │ Speed │ Use For │
│ ───────────────────────────────────────────────────────── │
│ Application │ Python │ ~1 ms │ Logic, state, │
│ Logic │ │ │ decisions │
│ ───────────────────────────────────────────────────────── │
│ Hardware │ C / SDK │ ~1 µs │ Drivers, │
│ Drivers │ │ │ protocols │
│ ───────────────────────────────────────────────────────── │
│ Hardware │ PIO / PWM │ ~100 ns │ Precise timing, │
│ Peripherals │ / DMA │ │ waveforms │
│ │
└────────────────────────────────────────────────────────────────┘
"Software for thinking, hardware for timing."
Summary: One-Sentence Takeaway
C tells the compiler what to do; Python tells the runtime what to do. Compilers run once at build time; runtimes run continuously during execution. That's the 200× difference.
Related Content
- Tutorial: Line Following - How timing affects control loops
- Introduction: How Slow is Python - Practical timing measurements
- Execution Models - Polling, interrupts, and RTOS patterns
Further Reading
- MicroPython Internals - How MicroPython works
- ARM Cortex-M33 Technical Reference - CPU architecture details
- CPython Bytecode - Similar concepts in standard Python