Boot Flow and Architectures
Goal: Understand how an embedded Linux system goes from power-on to a running application, how boot flow differs across platforms, and where to optimize and debug each stage.
Related Tutorials
For hands-on practice, see: Boot Timing Lab | Exploring Linux | Buildroot
Your device boots in 35 seconds and occasionally hangs before the app starts. You need to answer two questions:
- which stage is slow?
- which stage failed?
Without a staged boot model, this is mostly guesswork.
Meanwhile, an STM32 runs main() in 50 milliseconds. A colleague's STM32MP1 takes 8 seconds but runs Linux with full networking.
Why does each platform boot so differently? The answer is not just "more software." It is a fundamentally different boot architecture driven by the presence (or absence) of an MMU, a filesystem, and a trust chain.
1. Boot Stages (High-Level)
An embedded Linux system does not start all at once — it boots in stages, each handing off to the next. Understanding these stages is essential because each one has its own failure modes, its own debugging tools, and its own optimization opportunities. When a device "hangs during boot," the first question is always: which stage?
- ROM/SoC boot code — hardwired in silicon, runs first, finds the bootloader on storage
- Bootloader (often U-Boot) — initializes DRAM, clocks, and storage, then loads the kernel into memory
- Linux kernel — probes hardware via device tree, loads drivers, mounts the root filesystem
- Init system (systemd) — starts services in dependency order, manages logging and supervision
- User application — your product code begins running
Each stage has different logs, tools, and failure modes. A bootloader failure gives you nothing but a serial console (or silence). A kernel failure gives you dmesg. A systemd failure gives you journalctl. Knowing which tool to reach for depends on knowing which stage you are in.
graph LR
A[Power On] -->|~0.5s| B[ROM Boot]
B -->|~1s| C[Bootloader<br>U-Boot]
C -->|~2s| D[Kernel<br>Init]
D -->|~3-10s| E[systemd<br>Services]
E -->|~1-5s| F[Application<br>Ready]
style A fill:#607D8B,color:#fff
style B fill:#9C27B0,color:#fff
style C fill:#FF9800,color:#fff
style D fill:#2196F3,color:#fff
style E fill:#4CAF50,color:#fff
style F fill:#00BCD4,color:#fff
Approximate timing for Raspberry Pi 4 with stock OS. Total: ~15-35 seconds. A minimal Buildroot image can boot in 3-10 seconds by removing unnecessary services.
2. Where Hardware Is Initialized
Hardware initialization is spread across three stages, and each stage has a specific responsibility. Putting initialization work in the wrong stage is a common source of startup delays and fragile behavior. For example, if a kernel driver busy-waits for a sensor that takes 2 seconds to power up, every boot pays that 2-second penalty — even though the sensor could have been initialized asynchronously while other services start.
- Bootloader: clocks, DRAM, storage basics — only what is needed to load the kernel
- Kernel: drivers, device tree probing, module init — hardware abstraction for user space
- User space: service config, app-level hardware policy — what to do with the hardware
The principle is: each stage initializes only what the next stage needs. The bootloader does not configure I2C sensors (that is the kernel's job). The kernel does not decide which sensor readings to log (that is the application's job). When these responsibilities leak across layers, boot time suffers and debugging becomes harder because the failure could be in any stage.
3. Boot Time Optimization Strategy
Boot time optimization is one of the most impactful engineering tasks in embedded Linux. A stock Raspberry Pi OS boots in 15–35 seconds; a tuned Buildroot image can boot in 3–10 seconds. The difference is not magic — it is the result of removing unnecessary work and parallelizing what remains. The key principle is to measure first, then optimize. Engineers who skip measurement often spend days optimizing the wrong stage.
The main strategies are:
- Remove unused services — every disabled service is guaranteed saved time
- Parallelize non-dependent services — systemd can start independent services simultaneously
- Defer non-critical initialization — start the application first, initialize optional hardware later
- Avoid blocking app start on optional devices — if the display driver takes 3 seconds to probe, do not hold up the data logger
Optimize with measurements, not assumptions. Use systemd-analyze blame to find the slow services, and dmesg timestamps to find slow kernel probes.
Concrete Optimizations and Their Impact
| Optimization | Typical Time Saved | Effort |
|---|---|---|
| Disable Bluetooth service | ~0.5 s | Low |
| Disable avahi-daemon (mDNS) | ~1.0 s | Low |
| Remove desktop/GUI packages | ~5-15 s | Medium |
| Use Buildroot instead of stock OS | ~10-25 s | High |
| Kernel: disable unused drivers | ~1-3 s | High |
Use systemd-analyze blame to find slow services |
Varies | Low |
| Stock Pi OS → Tuned Buildroot | ~10-25 s total | High |
A stock Raspberry Pi OS typically boots in 15–35 seconds. A tuned Buildroot image with only essential services can reach 3–10 seconds — the difference comes from removing the hundreds of packages and services that a desktop distribution ships by default.
4. Debugging by Stage
When a boot problem occurs, the most important first step is identifying which stage failed. Each stage has its own debugging tools, and reaching for the wrong tool wastes time. If the bootloader never hands off to the kernel, dmesg will show you nothing — because the kernel never ran. If systemd cannot start your service, the bootloader and kernel logs will look perfectly healthy.
Bootloader stage — the hardest to debug because the OS is not yet running. Your primary tool is the serial console. If you see nothing on the serial console, the ROM or bootloader failed to initialize: - serial console output - bootloader env and image paths
Kernel stage — the kernel logs everything it does during initialization. A missing driver probe, a device tree binding error, or a failed filesystem mount all appear here:
- dmesg
- missing driver probes
- device tree binding errors
User-space stage — once systemd is running, it provides rich diagnostic tools. Most boot delays live here, in slow or failing services:
- systemd-analyze
- journalctl -b
- failed service dependencies
Annotated systemd-analyze Output
$ systemd-analyze time
Startup finished in 1.512s (kernel) + 12.345s (userspace) = 13.857s
graphical.target reached after 12.100s in userspace
This tells you the kernel initialized in 1.5 seconds, but userspace services took 12.3 seconds. Focus optimization effort on userspace.
Annotated dmesg Snippet (First Boot Messages)
[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd083] <- kernel starts
[ 0.000000] Machine model: Raspberry Pi 4 Model B <- DT identified board
[ 0.524173] spi-bcm2835 fe204000.spi: chipselect 0 already in use <- SPI init
[ 1.023456] i2c_dev: i2c /dev entries driver <- I2C subsystem ready
[ 1.245678] EXT4-fs (mmcblk0p2): mounted filesystem <- rootfs mounted
Each [timestamp] shows seconds since kernel start. Gaps between timestamps reveal where time is spent.
How to Read dmesg Output
When scanning dmesg output, look for these patterns:
| Pattern | Meaning | Action |
|---|---|---|
| Large timestamp gap (e.g., 0.5 s → 3.2 s) | A slow subsystem or driver probe | Investigate what module loaded in that window |
error or timeout |
A driver or subsystem failed | Check wiring, device tree, or module dependencies |
deferred |
A driver postponed initialization | It depends on another driver not yet ready — usually resolves later |
probe |
A driver is binding to a device | Normal — this is how the kernel discovers hardware |
Annotated systemd-analyze blame Output
blame lists services sorted by individual startup duration (slowest first):
$ systemd-analyze blame
8.123s NetworkManager-wait-online.service
3.456s apt-daily.service
2.789s bluetooth.service
1.234s avahi-daemon.service
0.987s ssh.service
The top entries are your primary optimization targets. Note that blame shows wall-clock time per service, not necessarily critical-path impact — a 3-second service that runs in parallel with a 5-second service does not add to total boot time.
Info
The boot target matters: graphical.target (desktop) includes display manager startup and is always slower than multi-user.target (headless server/appliance). If your embedded system has no desktop, switch to multi-user target: sudo systemctl set-default multi-user.target.
Quick Checks (Boot Flow)
- time spent before kernel starts
- kernel probe failures
- slowest services in init phase
- app start dependency chain
5. Generic Boot Stage Model
All platforms follow a staged boot, but the number of stages varies. A unified naming convention helps compare them:
| Stage | Name | Role |
|---|---|---|
| BL0 | Boot ROM | Hardcoded in silicon, loads first external code |
| BL1 | FSBL (First Stage Boot Loader) | Initializes DRAM, clocks, loads next stage |
| BL2 | Secure Firmware | TF-A / TF-M / UEFI Secure Phase |
| BL3 | SSBL (Second Stage Boot Loader) | U-Boot / GRUB / UEFI Boot Manager |
| BL4 | Kernel | Linux / RTOS / bare-metal main() |
| BL5 | Init System | systemd / BusyBox init / none |
| BL6 | Application | Product code |
graph TD
BL0[BL0: Boot ROM<br>Silicon-fixed] --> BL1[BL1: FSBL<br>DRAM + clocks]
BL1 --> BL2[BL2: Secure FW<br>TF-A / TF-M / UEFI]
BL2 --> BL3[BL3: SSBL<br>U-Boot / GRUB]
BL3 --> BL4[BL4: Kernel<br>Linux / RTOS]
BL4 --> BL5[BL5: Init<br>systemd / init]
BL5 --> BL6[BL6: Application<br>Product code]
style BL0 fill:#607D8B,color:#fff
style BL1 fill:#9C27B0,color:#fff
style BL2 fill:#E91E63,color:#fff
style BL3 fill:#FF9800,color:#fff
style BL4 fill:#2196F3,color:#fff
style BL5 fill:#4CAF50,color:#fff
style BL6 fill:#00BCD4,color:#fff
Not every platform uses all stages. An MCU may jump from BL0 directly to BL4 (bare-metal main). The key insight is that more stages exist because more hardware needs initialization and more trust decisions must be made.
6. Platform Comparison Table
| Feature | STM32 (Cortex-M) | STM32MP1 (A7 + M4) | Raspberry Pi 4 (A72) | PC (x86-64) |
|---|---|---|---|---|
| CPU type | Cortex-M4/M7 | Cortex-A7 + Cortex-M4 | Cortex-A72 (quad) | x86-64 |
| MMU | No | Yes (A7), No (M4) | Yes | Yes |
| OS | Bare-metal / RTOS | Linux (A7) + RTOS (M4) | Linux | Linux / Windows |
| Boot ROM | Internal flash boot | ROM + OTP fuses | GPU-based ROM | CPU microcode + UEFI ROM |
| Bootloader | None (direct flash) | TF-A (BL2) + U-Boot (BL3) | start4.elf (GPU) | UEFI + GRUB |
| Filesystem | None (flash directly) | ext4 / squashfs | ext4 | ext4 / NTFS |
| Drivers | HAL / register access | Linux kernel drivers | Linux kernel drivers | Linux kernel drivers |
| Typical boot time | < 100 ms | 5-10 s | 15-35 s | 10-30 s |
| Real-time capable | Yes (inherent) | Yes (M4 core) | With PREEMPT_RT | With PREEMPT_RT |
7. Boot Flow per Platform
STM32 (Cortex-M): ROM to main()
graph TD
A[Power On] -->|~1 ms| B[Boot ROM<br>Check BOOT pins]
B -->|~10 ms| C[Internal Flash<br>Vector table]
C -->|~20 ms| D[SystemInit<br>Clocks + PLL]
D -->|~50 ms| E["main()<br>Application running"]
style A fill:#607D8B,color:#fff
style E fill:#4CAF50,color:#fff
No OS, no filesystem, no bootloader. The CPU fetches the reset vector from flash and runs. Total: under 100 ms.
Raspberry Pi 4: GPU boots the CPU
graph TD
A[Power On] -->|~1 s| B[GPU ROM<br>Reads SD card]
B -->|~1 s| C[start4.elf<br>GPU firmware]
C -->|~0.5 s| D[config.txt<br>+ device tree]
D -->|~2 s| E[Linux Kernel<br>Decompresses + probes]
E -->|~5-20 s| F[systemd<br>Services start]
F -->|~1-5 s| G[Application<br>Ready]
style A fill:#607D8B,color:#fff
style B fill:#9C27B0,color:#fff
style C fill:#FF9800,color:#fff
style E fill:#2196F3,color:#fff
style F fill:#4CAF50,color:#fff
style G fill:#00BCD4,color:#fff
Unique: the GPU boots first and initializes the ARM CPU. There is no traditional U-Boot stage. The GPU firmware reads config.txt for configuration.
STM32MP1: ARM Trusted Firmware chain
graph TD
A[Power On] -->|~0.5 s| B[Boot ROM<br>OTP fuses + boot pins]
B -->|~1 s| C[TF-A / FSBL<br>DDR init + clocks]
C -->|~0.5 s| D[OP-TEE<br>Secure World]
D -->|~2 s| E[U-Boot / SSBL<br>Load kernel + DT]
E -->|~2 s| F[Linux Kernel]
F -->|~3-5 s| G[systemd<br>Services]
G -->|~1 s| H[Application]
style A fill:#607D8B,color:#fff
style B fill:#9C27B0,color:#fff
style C fill:#E91E63,color:#fff
style D fill:#E91E63,color:#fff
style E fill:#FF9800,color:#fff
style F fill:#2196F3,color:#fff
style G fill:#4CAF50,color:#fff
style H fill:#00BCD4,color:#fff
This is the most complete embedded boot chain: ROM, FSBL (TF-A), secure world (OP-TEE), SSBL (U-Boot), kernel, init, application.
PC (x86): UEFI firmware chain
graph TD
A[Power On] -->|~1 s| B[CPU ROM<br>Microcode + UEFI SEC]
B -->|~2 s| C[UEFI PEI + DXE<br>DRAM + PCIe + USB]
C -->|~1 s| D[GRUB / Boot Manager<br>Select kernel]
D -->|~2 s| E[Linux Kernel]
E -->|~5-15 s| F[systemd<br>Services]
F -->|~2-5 s| G[Desktop / Application]
style A fill:#607D8B,color:#fff
style C fill:#FF9800,color:#fff
style D fill:#FF9800,color:#fff
style E fill:#2196F3,color:#fff
style F fill:#4CAF50,color:#fff
style G fill:#00BCD4,color:#fff
PCs have the most complex hardware enumeration (PCIe, USB, SATA). UEFI replaces the old BIOS and provides Secure Boot via signed bootloaders.
8. Secure Boot Comparison
Secure boot ensures that every piece of software running on the device — from the first bootloader to the application — has been cryptographically verified. Without it, an attacker who gains physical access (or intercepts a firmware update) can replace the bootloader or kernel with malicious code, and the device will happily execute it. The strength of secure boot depends on having a hardware root of trust — a key stored in one-time-programmable fuses or a TPM that cannot be modified after manufacturing.
The chain of trust works by having each boot stage cryptographically verify the next stage before handing over execution. The first link in the chain is anchored in hardware (OTP fuses or TPM) that cannot be modified after manufacturing. If any verification fails, the boot halts — an attacker cannot insert malicious code at any point without breaking the chain. Each platform implements this differently, reflecting its target market and security requirements.
| Feature | STM32 (Cortex-M) | STM32MP1 (A7 + M4) | Raspberry Pi 4 | PC (x86) |
|---|---|---|---|---|
| Root of trust | RDP + option bytes | OTP fuses (ROM) | Limited (no HW root) | UEFI Secure Boot + TPM |
| Signed bootloader | Optional (custom) | TF-A signed (BL2) | Not enforced | UEFI + shim |
| Trusted firmware | TF-M (optional) | TF-A + OP-TEE | None standard | UEFI + measured boot |
| Chain of trust | Flash lock only | ROM -> TF-A -> U-Boot -> Kernel | Partial (config.txt) | ROM -> UEFI -> GRUB -> Kernel |
| Secure storage | Option bytes / RDP | OTP + OP-TEE secure storage | None hardware-backed | TPM 2.0 |
STM32MP1 Chain of Trust
graph LR
A[OTP Fuses<br>Root of Trust] -->|Verifies| B[TF-A<br>FSBL signed]
B -->|Verifies| C[OP-TEE<br>Secure World]
B -->|Verifies| D[U-Boot<br>SSBL signed]
D -->|Verifies| E[Kernel<br>FIT image signed]
E -->|Verifies| F[Root FS<br>dm-verity]
style A fill:#E91E63,color:#fff
style B fill:#E91E63,color:#fff
style C fill:#9C27B0,color:#fff
style F fill:#4CAF50,color:#fff
Each stage verifies the next before handing over execution. If any verification fails, the boot halts. This is the meaning of "chain of trust."
9. FSBL / SSBL / TF-A / TF-M / OP-TEE Naming
If the boot stage names feel confusing, you are not alone. The embedded industry uses multiple naming conventions for the same concept — a first-stage bootloader is called "FSBL" in some contexts, "BL2" in Arm's terminology, "SPL" in U-Boot's terminology, and "TF-A" when it is specifically the Arm Trusted Firmware implementation. The confusion compounds because each SoC vendor uses its own subset of these names, and marketing materials often mix them. The table below maps the generic boot levels to the vendor-specific names for each platform covered in this course, so you can translate between documentation from different sources.
The naming is confusing because vendors use different terms for the same generic boot level — "BL2" means the secure firmware stage in Arm's TF-A terminology, but some vendor docs use "BL2" for the second-stage bootloader (U-Boot). Always check which naming convention a document follows. This table maps them:
| Generic Level | STM32 (Cortex-M) | STM32MP1 | Raspberry Pi 4 | PC (x86) |
|---|---|---|---|---|
| BL0: Boot ROM | Internal ROM | Boot ROM + OTP | GPU ROM | CPU microcode |
| BL1: FSBL | N/A (direct flash) | TF-A (BL2) | start4.elf | UEFI PEI |
| BL2: Secure FW | TF-M (optional) | OP-TEE (BL32) | N/A | UEFI DXE |
| BL3: SSBL | N/A | U-Boot (BL33) | N/A | GRUB |
| BL4: Kernel | main() | Linux / FreeRTOS | Linux | Linux |
Key rules to remember:
- Cortex-M secure firmware is called TF-M (Trusted Firmware for M-profile)
- Cortex-A secure firmware is called TF-A (Trusted Firmware for A-profile)
- The Linux secure world runtime on Cortex-A is OP-TEE (Open Portable Trusted Execution Environment)
- "FSBL" and "SSBL" are generic terms; the vendor-specific name depends on the platform
10. Heterogeneous SoCs
The STM32MP1 is a "bridge" architecture: it combines a Linux-capable Cortex-A7 with a real-time Cortex-M4 on the same die. This matters for architecture decisions:
| Concern | Cortex-A7 (Linux) | Cortex-M4 (RTOS) |
|---|---|---|
| Role | Networking, UI, storage, logging | Motor control, sensor sampling, safety |
| OS | Linux + systemd | FreeRTOS / bare-metal |
| Boot time | 5-10 seconds | < 50 ms |
| Latency | ms-level (non-deterministic) | us-level (deterministic) |
| Communication | RPMsg / shared memory | RPMsg / shared memory |
This split lets you run a rich OS for connectivity and management while keeping hard real-time guarantees on the M4 core. The alternative -- running everything on Linux with PREEMPT_RT -- works for soft real-time but cannot match the determinism of a dedicated MCU core.
Design guideline: Put safety-critical and time-critical loops on the M4. Put networking, storage, UI, and OTA updates on the A7. Define the interface between them (shared memory + RPMsg) early in the project.
Quick Checks (Boot Architectures)
Solution Sketch: Smart Greenhouse Controller
Cortex-M4 (FreeRTOS):
- Soil moisture + temperature sampling every 100 ms (hard RT)
- Fan/valve actuator control loop
- Boots from internal flash in < 50 ms → sensor loop active within 200 ms
- Measurable requirement: worst-case sensor read jitter < 1 ms
Cortex-A7 (Linux):
- 7-inch LCD status display (DRM/KMS, single fullscreen app)
- Web dashboard server (lighttpd or Flask)
- WiFi firmware update client (OTA via HTTPS)
- Boots in 5–8 s; sensor data already flowing from M4 before Linux is ready
- Measurable requirement: OTA update completes in < 60 s over WiFi
Communication: RPMsg over shared memory — M4 publishes sensor readings at 10 Hz, A7 subscribes for display and logging. Shared memory region defined in device tree.
Boot sequence:
- M4: ROM → internal flash → FreeRTOS → sensor loop (< 200 ms)
- A7: ROM → TF-A → U-Boot → Linux → systemd → dashboard app (~6 s)
- Can you identify all boot stages between power-on and your application on your target platform?
- Which stage would you instrument first to debug a slow boot?
- Does your platform support a hardware root of trust, and is it enabled?
- If you need sub-100 ms boot, which platform class is the only viable option?
Mini Exercise 1: Boot Log Analysis
Here is a partial boot log. Label each line by stage (bootloader / kernel / init / app) and suggest one measurable optimization:
[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.821432] i2c_dev: i2c /dev entries driver
[ 1.245678] EXT4-fs (mmcblk0p2): mounted filesystem
[ 12.345678] systemd[1]: Started Avahi mDNS/DNS-SD Stack
[ 13.456789] systemd[1]: Started data-logger.service
[ 15.000000] data-logger: first measurement recorded
Answer
[0.000000]— Kernel stage (kernel starts booting)[0.821432]— Kernel stage (I2C driver initialization)[1.245678]— Kernel stage (root filesystem mounted — kernel is nearly done)[12.345678]— Init stage (systemd starting services — note the 11-second gap from kernel to here)[13.456789]— Init stage (your application service started by systemd)[15.000000]— App stage (application code running, first output)
Optimization suggestion: Avahi (mDNS) started at 12.3 s and is likely on the critical path. If the data logger does not need network discovery, disabling avahi-daemon.service would save ~1 second and simplify the boot dependency chain. Measurable target: reduce userspace time from 13.7 s to under 12.5 s.
Mini Exercise 2: Smart Greenhouse Design
You are designing a smart greenhouse controller that must:
- Read soil moisture and temperature every 100 ms (hard real-time)
- Display status on a 7-inch LCD with a web dashboard
- Receive firmware updates over Wi-Fi
- Boot the sensor loop within 200 ms of power-on
Design the system using a heterogeneous SoC (e.g., STM32MP1). Specify:
- Which tasks run on the Cortex-A core and which on the Cortex-M core
- The communication mechanism between the two cores
- The boot sequence for each core (list the stages)
- One measurable requirement per core with a specific target number
Key Takeaways
- Boot is a pipeline, not a single step.
- Each stage has different failure modes and debugging tools.
- Embedded systems often optimize boot time and reliability by removing unnecessary work and parallelizing what remains.
- Boot architecture is determined by hardware capabilities (MMU, secure elements, core count), not just software choices.
- More boot stages exist to initialize more complex hardware and enforce more trust decisions.
- Heterogeneous SoCs let you combine Linux flexibility with MCU-level real-time guarantees -- but require explicit architectural decisions about task placement and inter-core communication.
Hands-On
Try this in practice: Tutorial: Buildroot — build a minimal image and measure boot time improvement. Use systemd-analyze on Linux targets and a logic analyzer or serial timestamps on MCU targets to identify which boot stage dominates.