Boot Flow and Architectures

Goal: Understand how an embedded Linux system goes from power-on to a running application, how boot flow differs across platforms, and where to optimize and debug each stage.

1. Boot Stages (High-Level)

An embedded Linux system does not start all at once — it boots in stages, each handing off to the next. Understanding these stages is essential because each one has its own failure modes, its own debugging tools, and its own optimization opportunities. When a device "hangs during boot," the first question is always: which stage?

ROM/SoC boot code — hardwired in silicon, runs first, finds the bootloader on storage
Bootloader (often U-Boot) — initializes DRAM, clocks, and storage, then loads the kernel into memory
Linux kernel — probes hardware via device tree, loads drivers, mounts the root filesystem
Init system (systemd) — starts services in dependency order, manages logging and supervision
User application — your product code begins running

Each stage has different logs, tools, and failure modes. A bootloader failure gives you nothing but a serial console (or silence). A kernel failure gives you dmesg. A systemd failure gives you journalctl. Knowing which tool to reach for depends on knowing which stage you are in.

graph LR
    A[Power On] -->|~0.5s| B[ROM Boot]
    B -->|~1s| C[Bootloader<br>U-Boot]
    C -->|~2s| D[Kernel<br>Init]
    D -->|~3-10s| E[systemd<br>Services]
    E -->|~1-5s| F[Application<br>Ready]

    style A fill:#607D8B,color:#fff
    style B fill:#9C27B0,color:#fff
    style C fill:#FF9800,color:#fff
    style D fill:#2196F3,color:#fff
    style E fill:#4CAF50,color:#fff
    style F fill:#00BCD4,color:#fff

Approximate timing for Raspberry Pi 4 with stock OS. Total: ~15-35 seconds. A minimal Buildroot image can boot in 3-10 seconds by removing unnecessary services.

2. Where Hardware Is Initialized

Hardware initialization is spread across three stages, and each stage has a specific responsibility. Putting initialization work in the wrong stage is a common source of startup delays and fragile behavior. For example, if a kernel driver busy-waits for a sensor that takes 2 seconds to power up, every boot pays that 2-second penalty — even though the sensor could have been initialized asynchronously while other services start.

Bootloader: clocks, DRAM, storage basics — only what is needed to load the kernel
Kernel: drivers, device tree probing, module init — hardware abstraction for user space
User space: service config, app-level hardware policy — what to do with the hardware

The principle is: each stage initializes only what the next stage needs. The bootloader does not configure I2C sensors (that is the kernel's job). The kernel does not decide which sensor readings to log (that is the application's job). When these responsibilities leak across layers, boot time suffers and debugging becomes harder because the failure could be in any stage.

3. Boot Time Optimization Strategy

Boot time optimization is one of the most impactful engineering tasks in embedded Linux. A stock Raspberry Pi OS boots in 15–35 seconds; a tuned Buildroot image can boot in 3–10 seconds. The difference is not magic — it is the result of removing unnecessary work and parallelizing what remains. The key principle is to measure first, then optimize. Engineers who skip measurement often spend days optimizing the wrong stage.

The main strategies are:

Remove unused services — every disabled service is guaranteed saved time
Parallelize non-dependent services — systemd can start independent services simultaneously
Defer non-critical initialization — start the application first, initialize optional hardware later
Avoid blocking app start on optional devices — if the display driver takes 3 seconds to probe, do not hold up the data logger

Optimize with measurements, not assumptions. Use systemd-analyze blame to find the slow services, and dmesg timestamps to find slow kernel probes.

Concrete Optimizations and Their Impact

Optimization	Typical Time Saved	Effort
Disable Bluetooth service	~0.5 s	Low
Disable avahi-daemon (mDNS)	~1.0 s	Low
Remove desktop/GUI packages	~5-15 s	Medium
Use Buildroot instead of stock OS	~10-25 s	High
Kernel: disable unused drivers	~1-3 s	High
Use `systemd-analyze blame` to find slow services	Varies	Low
Stock Pi OS → Tuned Buildroot	~10-25 s total	High

A stock Raspberry Pi OS typically boots in 15–35 seconds. A tuned Buildroot image with only essential services can reach 3–10 seconds — the difference comes from removing the hundreds of packages and services that a desktop distribution ships by default.

4. Debugging by Stage

When a boot problem occurs, the most important first step is identifying which stage failed. Each stage has its own debugging tools, and reaching for the wrong tool wastes time. If the bootloader never hands off to the kernel, dmesg will show you nothing — because the kernel never ran. If systemd cannot start your service, the bootloader and kernel logs will look perfectly healthy.

Bootloader stage — the hardest to debug because the OS is not yet running. Your primary tool is the serial console. If you see nothing on the serial console, the ROM or bootloader failed to initialize: - serial console output - bootloader env and image paths

Kernel stage — the kernel logs everything it does during initialization. A missing driver probe, a device tree binding error, or a failed filesystem mount all appear here: - dmesg - missing driver probes - device tree binding errors

User-space stage — once systemd is running, it provides rich diagnostic tools. Most boot delays live here, in slow or failing services: - systemd-analyze - journalctl -b - failed service dependencies

Annotated `systemd-analyze` Output

$ systemd-analyze time
Startup finished in 1.512s (kernel) + 12.345s (userspace) = 13.857s
graphical.target reached after 12.100s in userspace

This tells you the kernel initialized in 1.5 seconds, but userspace services took 12.3 seconds. Focus optimization effort on userspace.

Annotated `dmesg` Snippet (First Boot Messages)

[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd083]  <- kernel starts
[    0.000000] Machine model: Raspberry Pi 4 Model B                    <- DT identified board
[    0.524173] spi-bcm2835 fe204000.spi: chipselect 0 already in use   <- SPI init
[    1.023456] i2c_dev: i2c /dev entries driver                        <- I2C subsystem ready
[    1.245678] EXT4-fs (mmcblk0p2): mounted filesystem                 <- rootfs mounted

Each [timestamp] shows seconds since kernel start. Gaps between timestamps reveal where time is spent.

How to Read `dmesg` Output

When scanning dmesg output, look for these patterns:

Pattern	Meaning	Action
Large timestamp gap (e.g., 0.5 s → 3.2 s)	A slow subsystem or driver probe	Investigate what module loaded in that window
`error` or `timeout`	A driver or subsystem failed	Check wiring, device tree, or module dependencies
`deferred`	A driver postponed initialization	It depends on another driver not yet ready — usually resolves later
`probe`	A driver is binding to a device	Normal — this is how the kernel discovers hardware

Annotated `systemd-analyze blame` Output

blame lists services sorted by individual startup duration (slowest first):

$ systemd-analyze blame
         8.123s NetworkManager-wait-online.service
         3.456s apt-daily.service
         2.789s bluetooth.service
         1.234s avahi-daemon.service
         0.987s ssh.service

The top entries are your primary optimization targets. Note that blame shows wall-clock time per service, not necessarily critical-path impact — a 3-second service that runs in parallel with a 5-second service does not add to total boot time.

Info

The boot target matters: graphical.target (desktop) includes display manager startup and is always slower than multi-user.target (headless server/appliance). If your embedded system has no desktop, switch to multi-user target: sudo systemctl set-default multi-user.target.

Quick Checks (Boot Flow)

time spent before kernel starts
kernel probe failures
slowest services in init phase
app start dependency chain

5. Generic Boot Stage Model

All platforms follow a staged boot, but the number of stages varies. A unified naming convention helps compare them:

Stage	Name	Role
BL0	Boot ROM	Hardcoded in silicon, loads first external code
BL1	FSBL (First Stage Boot Loader)	Initializes DRAM, clocks, loads next stage
BL2	Secure Firmware	TF-A / TF-M / UEFI Secure Phase
BL3	SSBL (Second Stage Boot Loader)	U-Boot / GRUB / UEFI Boot Manager
BL4	Kernel	Linux / RTOS / bare-metal main()
BL5	Init System	systemd / BusyBox init / none
BL6	Application	Product code

graph TD
    BL0[BL0: Boot ROM<br>Silicon-fixed] --> BL1[BL1: FSBL<br>DRAM + clocks]
    BL1 --> BL2[BL2: Secure FW<br>TF-A / TF-M / UEFI]
    BL2 --> BL3[BL3: SSBL<br>U-Boot / GRUB]
    BL3 --> BL4[BL4: Kernel<br>Linux / RTOS]
    BL4 --> BL5[BL5: Init<br>systemd / init]
    BL5 --> BL6[BL6: Application<br>Product code]

    style BL0 fill:#607D8B,color:#fff
    style BL1 fill:#9C27B0,color:#fff
    style BL2 fill:#E91E63,color:#fff
    style BL3 fill:#FF9800,color:#fff
    style BL4 fill:#2196F3,color:#fff
    style BL5 fill:#4CAF50,color:#fff
    style BL6 fill:#00BCD4,color:#fff

Not every platform uses all stages. An MCU may jump from BL0 directly to BL4 (bare-metal main). The key insight is that more stages exist because more hardware needs initialization and more trust decisions must be made.

6. Platform Comparison Table

Feature	STM32 (Cortex-M)	STM32MP1 (A7 + M4)	Raspberry Pi 4 (A72)	PC (x86-64)
CPU type	Cortex-M4/M7	Cortex-A7 + Cortex-M4	Cortex-A72 (quad)	x86-64
MMU	No	Yes (A7), No (M4)	Yes	Yes
OS	Bare-metal / RTOS	Linux (A7) + RTOS (M4)	Linux	Linux / Windows
Boot ROM	Internal flash boot	ROM + OTP fuses	GPU-based ROM	CPU microcode + UEFI ROM
Bootloader	None (direct flash)	TF-A (BL2) + U-Boot (BL3)	start4.elf (GPU)	UEFI + GRUB
Filesystem	None (flash directly)	ext4 / squashfs	ext4	ext4 / NTFS
Drivers	HAL / register access	Linux kernel drivers	Linux kernel drivers	Linux kernel drivers
Typical boot time	< 100 ms	5-10 s	15-35 s	10-30 s
Real-time capable	Yes (inherent)	Yes (M4 core)	With PREEMPT_RT	With PREEMPT_RT

7. Boot Flow per Platform

STM32 (Cortex-M): ROM to main()

graph TD
    A[Power On] -->|~1 ms| B[Boot ROM<br>Check BOOT pins]
    B -->|~10 ms| C[Internal Flash<br>Vector table]
    C -->|~20 ms| D[SystemInit<br>Clocks + PLL]
    D -->|~50 ms| E["main()<br>Application running"]

    style A fill:#607D8B,color:#fff
    style E fill:#4CAF50,color:#fff

No OS, no filesystem, no bootloader. The CPU fetches the reset vector from flash and runs. Total: under 100 ms.

Raspberry Pi 4: GPU boots the CPU

graph TD
    A[Power On] -->|~1 s| B[GPU ROM<br>Reads SD card]
    B -->|~1 s| C[start4.elf<br>GPU firmware]
    C -->|~0.5 s| D[config.txt<br>+ device tree]
    D -->|~2 s| E[Linux Kernel<br>Decompresses + probes]
    E -->|~5-20 s| F[systemd<br>Services start]
    F -->|~1-5 s| G[Application<br>Ready]

    style A fill:#607D8B,color:#fff
    style B fill:#9C27B0,color:#fff
    style C fill:#FF9800,color:#fff
    style E fill:#2196F3,color:#fff
    style F fill:#4CAF50,color:#fff
    style G fill:#00BCD4,color:#fff

Unique: the GPU boots first and initializes the ARM CPU. There is no traditional U-Boot stage. The GPU firmware reads config.txt for configuration.

STM32MP1: ARM Trusted Firmware chain

graph TD
    A[Power On] -->|~0.5 s| B[Boot ROM<br>OTP fuses + boot pins]
    B -->|~1 s| C[TF-A / FSBL<br>DDR init + clocks]
    C -->|~0.5 s| D[OP-TEE<br>Secure World]
    D -->|~2 s| E[U-Boot / SSBL<br>Load kernel + DT]
    E -->|~2 s| F[Linux Kernel]
    F -->|~3-5 s| G[systemd<br>Services]
    G -->|~1 s| H[Application]

    style A fill:#607D8B,color:#fff
    style B fill:#9C27B0,color:#fff
    style C fill:#E91E63,color:#fff
    style D fill:#E91E63,color:#fff
    style E fill:#FF9800,color:#fff
    style F fill:#2196F3,color:#fff
    style G fill:#4CAF50,color:#fff
    style H fill:#00BCD4,color:#fff

This is the most complete embedded boot chain: ROM, FSBL (TF-A), secure world (OP-TEE), SSBL (U-Boot), kernel, init, application.

PC (x86): UEFI firmware chain

graph TD
    A[Power On] -->|~1 s| B[CPU ROM<br>Microcode + UEFI SEC]
    B -->|~2 s| C[UEFI PEI + DXE<br>DRAM + PCIe + USB]
    C -->|~1 s| D[GRUB / Boot Manager<br>Select kernel]
    D -->|~2 s| E[Linux Kernel]
    E -->|~5-15 s| F[systemd<br>Services]
    F -->|~2-5 s| G[Desktop / Application]

    style A fill:#607D8B,color:#fff
    style C fill:#FF9800,color:#fff
    style D fill:#FF9800,color:#fff
    style E fill:#2196F3,color:#fff
    style F fill:#4CAF50,color:#fff
    style G fill:#00BCD4,color:#fff

PCs have the most complex hardware enumeration (PCIe, USB, SATA). UEFI replaces the old BIOS and provides Secure Boot via signed bootloaders.

8. Secure Boot Comparison

Secure boot ensures that every piece of software running on the device — from the first bootloader to the application — has been cryptographically verified. Without it, an attacker who gains physical access (or intercepts a firmware update) can replace the bootloader or kernel with malicious code, and the device will happily execute it. The strength of secure boot depends on having a hardware root of trust — a key stored in one-time-programmable fuses or a TPM that cannot be modified after manufacturing.

The chain of trust works by having each boot stage cryptographically verify the next stage before handing over execution. The first link in the chain is anchored in hardware (OTP fuses or TPM) that cannot be modified after manufacturing. If any verification fails, the boot halts — an attacker cannot insert malicious code at any point without breaking the chain. Each platform implements this differently, reflecting its target market and security requirements.

Feature	STM32 (Cortex-M)	STM32MP1 (A7 + M4)	Raspberry Pi 4	PC (x86)
Root of trust	RDP + option bytes	OTP fuses (ROM)	Limited (no HW root)	UEFI Secure Boot + TPM
Signed bootloader	Optional (custom)	TF-A signed (BL2)	Not enforced	UEFI + shim
Trusted firmware	TF-M (optional)	TF-A + OP-TEE	None standard	UEFI + measured boot
Chain of trust	Flash lock only	ROM -> TF-A -> U-Boot -> Kernel	Partial (config.txt)	ROM -> UEFI -> GRUB -> Kernel
Secure storage	Option bytes / RDP	OTP + OP-TEE secure storage	None hardware-backed	TPM 2.0

STM32MP1 Chain of Trust

graph LR
    A[OTP Fuses<br>Root of Trust] -->|Verifies| B[TF-A<br>FSBL signed]
    B -->|Verifies| C[OP-TEE<br>Secure World]
    B -->|Verifies| D[U-Boot<br>SSBL signed]
    D -->|Verifies| E[Kernel<br>FIT image signed]
    E -->|Verifies| F[Root FS<br>dm-verity]

    style A fill:#E91E63,color:#fff
    style B fill:#E91E63,color:#fff
    style C fill:#9C27B0,color:#fff
    style F fill:#4CAF50,color:#fff

Each stage verifies the next before handing over execution. If any verification fails, the boot halts. This is the meaning of "chain of trust."

9. FSBL / SSBL / TF-A / TF-M / OP-TEE Naming

If the boot stage names feel confusing, you are not alone. The embedded industry uses multiple naming conventions for the same concept — a first-stage bootloader is called "FSBL" in some contexts, "BL2" in Arm's terminology, "SPL" in U-Boot's terminology, and "TF-A" when it is specifically the Arm Trusted Firmware implementation. The confusion compounds because each SoC vendor uses its own subset of these names, and marketing materials often mix them. The table below maps the generic boot levels to the vendor-specific names for each platform covered in this course, so you can translate between documentation from different sources.

The naming is confusing because vendors use different terms for the same generic boot level — "BL2" means the secure firmware stage in Arm's TF-A terminology, but some vendor docs use "BL2" for the second-stage bootloader (U-Boot). Always check which naming convention a document follows. This table maps them:

Generic Level	STM32 (Cortex-M)	STM32MP1	Raspberry Pi 4	PC (x86)
BL0: Boot ROM	Internal ROM	Boot ROM + OTP	GPU ROM	CPU microcode
BL1: FSBL	N/A (direct flash)	TF-A (BL2)	start4.elf	UEFI PEI
BL2: Secure FW	TF-M (optional)	OP-TEE (BL32)	N/A	UEFI DXE
BL3: SSBL	N/A	U-Boot (BL33)	N/A	GRUB
BL4: Kernel	main()	Linux / FreeRTOS	Linux	Linux

Key rules to remember:

Cortex-M secure firmware is called TF-M (Trusted Firmware for M-profile)
Cortex-A secure firmware is called TF-A (Trusted Firmware for A-profile)
The Linux secure world runtime on Cortex-A is OP-TEE (Open Portable Trusted Execution Environment)
"FSBL" and "SSBL" are generic terms; the vendor-specific name depends on the platform

10. Heterogeneous SoCs

The STM32MP1 is a "bridge" architecture: it combines a Linux-capable Cortex-A7 with a real-time Cortex-M4 on the same die. This matters for architecture decisions:

Concern	Cortex-A7 (Linux)	Cortex-M4 (RTOS)
Role	Networking, UI, storage, logging	Motor control, sensor sampling, safety
OS	Linux + systemd	FreeRTOS / bare-metal
Boot time	5-10 seconds	< 50 ms
Latency	ms-level (non-deterministic)	us-level (deterministic)
Communication	RPMsg / shared memory	RPMsg / shared memory

This split lets you run a rich OS for connectivity and management while keeping hard real-time guarantees on the M4 core. The alternative -- running everything on Linux with PREEMPT_RT -- works for soft real-time but cannot match the determinism of a dedicated MCU core.

Design guideline: Put safety-critical and time-critical loops on the M4. Put networking, storage, UI, and OTA updates on the A7. Define the interface between them (shared memory + RPMsg) early in the project.

Quick Checks (Boot Architectures)

Solution Sketch: Smart Greenhouse Controller

Cortex-M4 (FreeRTOS):

Soil moisture + temperature sampling every 100 ms (hard RT)
Fan/valve actuator control loop
Boots from internal flash in < 50 ms → sensor loop active within 200 ms
Measurable requirement: worst-case sensor read jitter < 1 ms

Cortex-A7 (Linux):

7-inch LCD status display (DRM/KMS, single fullscreen app)
Web dashboard server (lighttpd or Flask)
WiFi firmware update client (OTA via HTTPS)
Boots in 5–8 s; sensor data already flowing from M4 before Linux is ready
Measurable requirement: OTA update completes in < 60 s over WiFi

Communication: RPMsg over shared memory — M4 publishes sensor readings at 10 Hz, A7 subscribes for display and logging. Shared memory region defined in device tree.

Boot sequence:

M4: ROM → internal flash → FreeRTOS → sensor loop (< 200 ms)
A7: ROM → TF-A → U-Boot → Linux → systemd → dashboard app (~6 s)

Can you identify all boot stages between power-on and your application on your target platform?
Which stage would you instrument first to debug a slow boot?
Does your platform support a hardware root of trust, and is it enabled?
If you need sub-100 ms boot, which platform class is the only viable option?

Mini Exercise 1: Boot Log Analysis

Here is a partial boot log. Label each line by stage (bootloader / kernel / init / app) and suggest one measurable optimization:

[    0.000000] Booting Linux on physical CPU 0x0
[    0.821432] i2c_dev: i2c /dev entries driver
[    1.245678] EXT4-fs (mmcblk0p2): mounted filesystem
[   12.345678] systemd[1]: Started Avahi mDNS/DNS-SD Stack
[   13.456789] systemd[1]: Started data-logger.service
[   15.000000] data-logger: first measurement recorded

Answer

[0.000000] — Kernel stage (kernel starts booting)
[0.821432] — Kernel stage (I2C driver initialization)
[1.245678] — Kernel stage (root filesystem mounted — kernel is nearly done)
[12.345678] — Init stage (systemd starting services — note the 11-second gap from kernel to here)
[13.456789] — Init stage (your application service started by systemd)
[15.000000] — App stage (application code running, first output)

Optimization suggestion: Avahi (mDNS) started at 12.3 s and is likely on the critical path. If the data logger does not need network discovery, disabling avahi-daemon.service would save ~1 second and simplify the boot dependency chain. Measurable target: reduce userspace time from 13.7 s to under 12.5 s.

Mini Exercise 2: Smart Greenhouse Design

You are designing a smart greenhouse controller that must:

Read soil moisture and temperature every 100 ms (hard real-time)
Display status on a 7-inch LCD with a web dashboard
Receive firmware updates over Wi-Fi
Boot the sensor loop within 200 ms of power-on

Design the system using a heterogeneous SoC (e.g., STM32MP1). Specify:

Which tasks run on the Cortex-A core and which on the Cortex-M core
The communication mechanism between the two cores
The boot sequence for each core (list the stages)
One measurable requirement per core with a specific target number

Key Takeaways

Boot is a pipeline, not a single step.
Each stage has different failure modes and debugging tools.
Embedded systems often optimize boot time and reliability by removing unnecessary work and parallelizing what remains.
Boot architecture is determined by hardware capabilities (MMU, secure elements, core count), not just software choices.
More boot stages exist to initialize more complex hardware and enforce more trust decisions.
Heterogeneous SoCs let you combine Linux flexibility with MCU-level real-time guarantees -- but require explicit architectural decisions about task placement and inter-core communication.

Hands-On

Try this in practice: Tutorial: Buildroot — build a minimal image and measure boot time improvement. Use systemd-analyze on Linux targets and a logic analyzer or serial timestamps on MCU targets to identify which boot stage dominates.