Why MicroPython is Slow: From Silicon to Interpreter

Advanced Reference | For students who want to understand the full technical picture

This document traces a simple operation (x += 1) from Python source code down to ARM assembly, explaining exactly why interpreted Python is ~200× slower than compiled C.

The Question

"I know C. I've used Arduino. Why does MicroPython feel so sluggish?"

The answer is architectural, not "bad design." Let's trace what actually happens.

Part 1: C on Cortex-M33 (What Really Happens)

We'll use the ARM Cortex-M33 (the CPU in the RP2350/Pico 2) as our reference.

Example: Increment a Global Variable

int x;
void foo(void) {
    x++;
}

Case A: Variable in Register (Best Case)

If the compiler keeps x in a register (e.g., r0):

adds r0, r0, #1

ONE instruction. On Cortex-M33: - Executes in 1 CPU cycle - No memory access - Deterministic timing

Case B: Variable in RAM (Realistic Case)

For global or volatile variables (the common embedded case):

ldr  r0, [r1]     ; Load x from memory into r0
adds r0, r0, #1   ; Increment r0
str  r0, [r1]     ; Store r0 back to memory

3 instructions. Timing (approximate):

Instruction	Cycles
`ldr`	1-2
`adds`	1
`str`	1-2
Total	3-5 cycles

At 150 MHz: ~20-30 nanoseconds

Why Is C So Fast?

The compiler resolved everything at compile time:

Question	Answered When?
What type is `x`?	Compile time
Where is `x` stored?	Compile time
How many bytes is `x`?	Compile time
What instruction increments it?	Compile time

The CPU executes pure intent. No decisions at runtime.

Part 2: MicroPython Execution Pipeline

When you write Python:

x += 1

It goes through multiple stages:

┌─────────────────────────────────────────────────────────────┐
│  MICROPYTHON EXECUTION PIPELINE                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Python source code                                        │
│         ↓                                                   │
│   Parser (tokenize, build AST)                              │
│         ↓                                                   │
│   Compiler (generate bytecode)                              │
│         ↓                                                   │
│   Bytecode (stored in memory)                               │
│         ↓                                                   │
│   Interpreter loop (C code)  ←── THIS IS WHERE TIME GOES    │
│         ↓                                                   │
│   Runtime object system                                     │
│         ↓                                                   │
│   ARM machine code                                          │
│         ↓                                                   │
│   CPU execution                                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Part 3: MicroPython Bytecode

MicroPython compiles x += 1 into bytecode similar to:

LOAD_NAME    x       ; Look up variable 'x'
LOAD_CONST   1       ; Load the constant 1
BINARY_OP    ADD     ; Add the two values
STORE_NAME   x       ; Store result back to 'x'

Important: This bytecode is NOT executed by hardware. It's interpreted by a C program running on the CPU.

Part 4: What Each Bytecode Actually Does

LOAD_NAME x (~50-100 instructions)

This single bytecode operation involves:

Get the name "x" - It's a string pointer
Hash the string - Compute hash("x") for dictionary lookup
Search the local namespace - Dictionary lookup
If not found, search global namespace - Another dictionary lookup
Follow the pointer - Retrieve the object reference
Push to VM stack - Store for next operation

// Simplified pseudocode of what the interpreter does:
mp_obj_t lookup_name(qstr name) {
    // Try local scope
    mp_obj_t *local = mp_locals_get(name);
    if (local != NULL) return *local;

    // Try global scope
    mp_obj_t *global = mp_globals_get(name);
    if (global != NULL) return *global;

    // Name error!
    mp_raise_NameError(name);
}

Each step involves: - Function calls - Pointer chasing - Conditional branches - Possible cache misses

LOAD_CONST 1 (~20-40 instructions)

Even the constant 1 is not a raw number—it's a Python object:

// MicroPython integer object structure
typedef struct _mp_obj_int_t {
    mp_obj_base_t base;    // Type information
    mp_int_t value;        // The actual integer value
} mp_obj_int_t;

The interpreter must: 1. Look up the constant in the constants table 2. Get or create the integer object 3. Push object pointer to VM stack

BINARY_OP ADD (~100-300 instructions)

This is the most expensive operation. The interpreter must:

Pop two operands from VM stack
Check type of left operand - Is it an int? float? string? list?
Check type of right operand - Same question
Resolve the correct add function - Integer add? Float add? String concatenation?
Call the add function
Allocate a NEW integer object for the result (integers are immutable!)
Initialize the new object - Set type, value, refcount
Push result to VM stack

// Simplified pseudocode
mp_obj_t binary_op_add(mp_obj_t lhs, mp_obj_t rhs) {
    // Type dispatch - many branches!
    if (mp_obj_is_small_int(lhs) && mp_obj_is_small_int(rhs)) {
        mp_int_t lval = MP_OBJ_SMALL_INT_VALUE(lhs);
        mp_int_t rval = MP_OBJ_SMALL_INT_VALUE(rhs);
        return mp_obj_new_int(lval + rval);  // Allocates!
    } else if (mp_obj_is_float(lhs) || mp_obj_is_float(rhs)) {
        // Float addition path...
    } else if (mp_obj_is_str(lhs) && mp_obj_is_str(rhs)) {
        // String concatenation path...
    }
    // ... many more type combinations
}

STORE_NAME x (~50-100 instructions)

Pop result from VM stack
Hash "x" again (or use cached hash)
Find dictionary entry
Decrement refcount of old value
Check if old value should be garbage collected
Store new value pointer
Increment refcount of new value

Part 5: The Interpreter Loop Overhead

Every bytecode runs inside the interpreter main loop:

// Simplified MicroPython interpreter loop
void mp_execute_bytecode(mp_code_state_t *code_state) {
    const byte *ip = code_state->ip;  // Instruction pointer

    while (1) {
        byte opcode = *ip++;  // Fetch next opcode

        switch (opcode) {
            case MP_BC_LOAD_NAME:
                // ... 50-100 instructions ...
                break;

            case MP_BC_LOAD_CONST:
                // ... 20-40 instructions ...
                break;

            case MP_BC_BINARY_OP:
                // ... 100-300 instructions ...
                break;

            case MP_BC_STORE_NAME:
                // ... 50-100 instructions ...
                break;

            // ... hundreds of other opcodes ...
        }
    }
}

This loop adds overhead for every single bytecode:

Overhead Source	Instructions
Fetch opcode from memory	2-4
Switch/dispatch table lookup	5-15
Branch to handler	2-5
Loop iteration	3-5
Total per opcode	~10-30

With 4 bytecodes for x += 1, that's ~40-120 extra instructions just for dispatching.

Part 6: Total Instruction Count

C: `x++`

Level	Count
ARM instructions	3
Time at 150 MHz	~20-30 ns

MicroPython: `x += 1`

Stage	Instructions
LOAD_NAME	~80
LOAD_CONST	~30
BINARY_OP ADD	~200
STORE_NAME	~80
Interpreter dispatch (×4)	~80
Total	~500-700
Time at 150 MHz	~4-6 µs

The Ratio

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   C:       x++;        →       3 instructions → ~25 ns     │
│   Python:  x += 1      →     600 instructions → ~5 µs      │
│                                                             │
│   RATIO: ~200× SLOWER                                       │
│                                                             │
│   This is NORMAL for interpreted languages.                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Part 7: Why Python MUST Do This

Python allows:

x = 1           # x is an integer
x = "hello"     # now x is a string
x = [1, 2, 3]   # now x is a list
x += 1          # What does += mean now? Depends on type!

At any point, any variable can hold any type.

So the interpreter must: - Check types at every operation - Use generic object representations - Resolve operations dynamically - Handle type errors at runtime

Flexibility is paid for on every single operation.

Part 8: The Compiler vs Runtime Trade-off

Question	C (Compiler)	Python (Runtime)
What type is `x`?	Known at compile time	Checked every operation
Where is `x`?	Fixed address	Dictionary lookup
What does `+` mean?	Fixed instruction	Type-dependent dispatch
Memory allocation?	Stack/static	Heap + GC
Error handling?	Compile-time (mostly)	Runtime exceptions

C expresses intent to the compiler. Python expresses intent to the runtime.

The compiler is orders of magnitude faster than a runtime interpreter.

Part 9: When MicroPython Can Be "Fast Enough"

MicroPython is efficient when you:

1. Avoid Tight Loops

# SLOW: Python loop with many iterations
total = 0
for i in range(10000):
    total += i

# FASTER: Use built-in (implemented in C)
total = sum(range(10000))

2. Use Native Modules

# These are implemented in C, not Python:
from machine import Pin, PWM
from neopixel import NeoPixel

# The Python call is slow, but the actual work is fast
led.on()  # Python overhead, then C does the work

3. Push Timing-Critical Code to Hardware

# DON'T: Bit-bang in Python
for bit in data:
    pin.value(bit)
    time.sleep_us(1)  # Way too slow!

# DO: Let PIO handle it
leds.write()  # Python just triggers; PIO does timing

4. Design Around Milliseconds, Not Microseconds

Time Scale	Python?	Example
Nanoseconds	✗ No	WS2812 protocol
Microseconds	✗ Marginal	Ultrasonic echo
Milliseconds	✓ Yes	Button debounce, display update
Seconds	✓ Yes	State machine transitions

Part 10: The Complete Picture

┌────────────────────────────────────────────────────────────────┐
│                    EMBEDDED SYSTEM ARCHITECTURE                 │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Layer           │ Language   │ Speed     │ Use For           │
│  ─────────────────────────────────────────────────────────     │
│   Application     │ Python     │ ~1 ms     │ Logic, state,     │
│   Logic           │            │           │ decisions         │
│  ─────────────────────────────────────────────────────────     │
│   Hardware        │ C / SDK    │ ~1 µs     │ Drivers,          │
│   Drivers         │            │           │ protocols         │
│  ─────────────────────────────────────────────────────────     │
│   Hardware        │ PIO / PWM  │ ~100 ns   │ Precise timing,   │
│   Peripherals     │ / DMA      │           │ waveforms         │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

"Software for thinking, hardware for timing."

Summary: One-Sentence Takeaway

C tells the compiler what to do; Python tells the runtime what to do. Compilers run once at build time; runtimes run continuously during execution. That's the 200× difference.

Tutorial: Line Following - How timing affects control loops
Introduction: How Slow is Python - Practical timing measurements
Execution Models - Polling, interrupts, and RTOS patterns