Skip to content

Why MicroPython is Slow: From Silicon to Interpreter

Advanced Reference | For students who want to understand the full technical picture

This document traces a simple operation (x += 1) from Python source code down to ARM assembly, explaining exactly why interpreted Python is ~200× slower than compiled C.


The Question

"I know C. I've used Arduino. Why does MicroPython feel so sluggish?"

The answer is architectural, not "bad design." Let's trace what actually happens.


Part 1: C on Cortex-M33 (What Really Happens)

We'll use the ARM Cortex-M33 (the CPU in the RP2350/Pico 2) as our reference.

Example: Increment a Global Variable

int x;
void foo(void) {
    x++;
}

Case A: Variable in Register (Best Case)

If the compiler keeps x in a register (e.g., r0):

adds r0, r0, #1

ONE instruction. On Cortex-M33: - Executes in 1 CPU cycle - No memory access - Deterministic timing

Case B: Variable in RAM (Realistic Case)

For global or volatile variables (the common embedded case):

ldr  r0, [r1]     ; Load x from memory into r0
adds r0, r0, #1   ; Increment r0
str  r0, [r1]     ; Store r0 back to memory

3 instructions. Timing (approximate):

Instruction Cycles
ldr 1-2
adds 1
str 1-2
Total 3-5 cycles

At 150 MHz: ~20-30 nanoseconds

Why Is C So Fast?

The compiler resolved everything at compile time:

Question Answered When?
What type is x? Compile time
Where is x stored? Compile time
How many bytes is x? Compile time
What instruction increments it? Compile time

The CPU executes pure intent. No decisions at runtime.


Part 2: MicroPython Execution Pipeline

When you write Python:

x += 1

It goes through multiple stages:

┌─────────────────────────────────────────────────────────────┐
│  MICROPYTHON EXECUTION PIPELINE                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Python source code                                        │
│         ↓                                                   │
│   Parser (tokenize, build AST)                              │
│         ↓                                                   │
│   Compiler (generate bytecode)                              │
│         ↓                                                   │
│   Bytecode (stored in memory)                               │
│         ↓                                                   │
│   Interpreter loop (C code)  ←── THIS IS WHERE TIME GOES    │
│         ↓                                                   │
│   Runtime object system                                     │
│         ↓                                                   │
│   ARM machine code                                          │
│         ↓                                                   │
│   CPU execution                                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Part 3: MicroPython Bytecode

MicroPython compiles x += 1 into bytecode similar to:

LOAD_NAME    x       ; Look up variable 'x'
LOAD_CONST   1       ; Load the constant 1
BINARY_OP    ADD     ; Add the two values
STORE_NAME   x       ; Store result back to 'x'

Important: This bytecode is NOT executed by hardware. It's interpreted by a C program running on the CPU.


Part 4: What Each Bytecode Actually Does

LOAD_NAME x (~50-100 instructions)

This single bytecode operation involves:

  1. Get the name "x" - It's a string pointer
  2. Hash the string - Compute hash("x") for dictionary lookup
  3. Search the local namespace - Dictionary lookup
  4. If not found, search global namespace - Another dictionary lookup
  5. Follow the pointer - Retrieve the object reference
  6. Push to VM stack - Store for next operation
// Simplified pseudocode of what the interpreter does:
mp_obj_t lookup_name(qstr name) {
    // Try local scope
    mp_obj_t *local = mp_locals_get(name);
    if (local != NULL) return *local;

    // Try global scope
    mp_obj_t *global = mp_globals_get(name);
    if (global != NULL) return *global;

    // Name error!
    mp_raise_NameError(name);
}

Each step involves: - Function calls - Pointer chasing - Conditional branches - Possible cache misses

LOAD_CONST 1 (~20-40 instructions)

Even the constant 1 is not a raw number—it's a Python object:

// MicroPython integer object structure
typedef struct _mp_obj_int_t {
    mp_obj_base_t base;    // Type information
    mp_int_t value;        // The actual integer value
} mp_obj_int_t;

The interpreter must: 1. Look up the constant in the constants table 2. Get or create the integer object 3. Push object pointer to VM stack

BINARY_OP ADD (~100-300 instructions)

This is the most expensive operation. The interpreter must:

  1. Pop two operands from VM stack
  2. Check type of left operand - Is it an int? float? string? list?
  3. Check type of right operand - Same question
  4. Resolve the correct add function - Integer add? Float add? String concatenation?
  5. Call the add function
  6. Allocate a NEW integer object for the result (integers are immutable!)
  7. Initialize the new object - Set type, value, refcount
  8. Push result to VM stack
// Simplified pseudocode
mp_obj_t binary_op_add(mp_obj_t lhs, mp_obj_t rhs) {
    // Type dispatch - many branches!
    if (mp_obj_is_small_int(lhs) && mp_obj_is_small_int(rhs)) {
        mp_int_t lval = MP_OBJ_SMALL_INT_VALUE(lhs);
        mp_int_t rval = MP_OBJ_SMALL_INT_VALUE(rhs);
        return mp_obj_new_int(lval + rval);  // Allocates!
    } else if (mp_obj_is_float(lhs) || mp_obj_is_float(rhs)) {
        // Float addition path...
    } else if (mp_obj_is_str(lhs) && mp_obj_is_str(rhs)) {
        // String concatenation path...
    }
    // ... many more type combinations
}

STORE_NAME x (~50-100 instructions)

  1. Pop result from VM stack
  2. Hash "x" again (or use cached hash)
  3. Find dictionary entry
  4. Decrement refcount of old value
  5. Check if old value should be garbage collected
  6. Store new value pointer
  7. Increment refcount of new value

Part 5: The Interpreter Loop Overhead

Every bytecode runs inside the interpreter main loop:

// Simplified MicroPython interpreter loop
void mp_execute_bytecode(mp_code_state_t *code_state) {
    const byte *ip = code_state->ip;  // Instruction pointer

    while (1) {
        byte opcode = *ip++;  // Fetch next opcode

        switch (opcode) {
            case MP_BC_LOAD_NAME:
                // ... 50-100 instructions ...
                break;

            case MP_BC_LOAD_CONST:
                // ... 20-40 instructions ...
                break;

            case MP_BC_BINARY_OP:
                // ... 100-300 instructions ...
                break;

            case MP_BC_STORE_NAME:
                // ... 50-100 instructions ...
                break;

            // ... hundreds of other opcodes ...
        }
    }
}

This loop adds overhead for every single bytecode:

Overhead Source Instructions
Fetch opcode from memory 2-4
Switch/dispatch table lookup 5-15
Branch to handler 2-5
Loop iteration 3-5
Total per opcode ~10-30

With 4 bytecodes for x += 1, that's ~40-120 extra instructions just for dispatching.


Part 6: Total Instruction Count

C: x++

Level Count
ARM instructions 3
Time at 150 MHz ~20-30 ns

MicroPython: x += 1

Stage Instructions
LOAD_NAME ~80
LOAD_CONST ~30
BINARY_OP ADD ~200
STORE_NAME ~80
Interpreter dispatch (×4) ~80
Total ~500-700
Time at 150 MHz ~4-6 µs

The Ratio

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   C:       x++;        →       3 instructions → ~25 ns     │
│   Python:  x += 1      →     600 instructions → ~5 µs      │
│                                                             │
│   RATIO: ~200× SLOWER                                       │
│                                                             │
│   This is NORMAL for interpreted languages.                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Part 7: Why Python MUST Do This

Python allows:

x = 1           # x is an integer
x = "hello"     # now x is a string
x = [1, 2, 3]   # now x is a list
x += 1          # What does += mean now? Depends on type!

At any point, any variable can hold any type.

So the interpreter must: - Check types at every operation - Use generic object representations - Resolve operations dynamically - Handle type errors at runtime

Flexibility is paid for on every single operation.


Part 8: The Compiler vs Runtime Trade-off

Question C (Compiler) Python (Runtime)
What type is x? Known at compile time Checked every operation
Where is x? Fixed address Dictionary lookup
What does + mean? Fixed instruction Type-dependent dispatch
Memory allocation? Stack/static Heap + GC
Error handling? Compile-time (mostly) Runtime exceptions

C expresses intent to the compiler. Python expresses intent to the runtime.

The compiler is orders of magnitude faster than a runtime interpreter.


Part 9: When MicroPython Can Be "Fast Enough"

MicroPython is efficient when you:

1. Avoid Tight Loops

# SLOW: Python loop with many iterations
total = 0
for i in range(10000):
    total += i

# FASTER: Use built-in (implemented in C)
total = sum(range(10000))

2. Use Native Modules

# These are implemented in C, not Python:
from machine import Pin, PWM
from neopixel import NeoPixel

# The Python call is slow, but the actual work is fast
led.on()  # Python overhead, then C does the work

3. Push Timing-Critical Code to Hardware

# DON'T: Bit-bang in Python
for bit in data:
    pin.value(bit)
    time.sleep_us(1)  # Way too slow!

# DO: Let PIO handle it
leds.write()  # Python just triggers; PIO does timing

4. Design Around Milliseconds, Not Microseconds

Time Scale Python? Example
Nanoseconds ✗ No WS2812 protocol
Microseconds ✗ Marginal Ultrasonic echo
Milliseconds ✓ Yes Button debounce, display update
Seconds ✓ Yes State machine transitions

Part 10: The Complete Picture

┌────────────────────────────────────────────────────────────────┐
│                    EMBEDDED SYSTEM ARCHITECTURE                 │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Layer           │ Language   │ Speed     │ Use For           │
│  ─────────────────────────────────────────────────────────     │
│   Application     │ Python     │ ~1 ms     │ Logic, state,     │
│   Logic           │            │           │ decisions         │
│  ─────────────────────────────────────────────────────────     │
│   Hardware        │ C / SDK    │ ~1 µs     │ Drivers,          │
│   Drivers         │            │           │ protocols         │
│  ─────────────────────────────────────────────────────────     │
│   Hardware        │ PIO / PWM  │ ~100 ns   │ Precise timing,   │
│   Peripherals     │ / DMA      │           │ waveforms         │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

"Software for thinking, hardware for timing."

Summary: One-Sentence Takeaway

C tells the compiler what to do; Python tells the runtime what to do. Compilers run once at build time; runtimes run continuously during execution. That's the 200× difference.



Further Reading