Reliability, Updates, and Watchdogs

Goal: Understand why embedded Linux systems require special practices for reliability and long‑term maintenance.

For Hands-On Practice

Related tutorials: Data Logger Appliance

Device works in lab but fails in field after sudden power cuts and partial updates. The SD card filesystem is corrupted, the half-written firmware image does not boot, and the watchdog was never enabled so the device just sits there — frozen, unrecoverable, and 200 kilometers from the nearest technician.

This is not a rare scenario. It is the default outcome for any embedded device that ships without explicit reliability engineering. Lab conditions — stable power, controlled reboots, developer access — mask the problems that field conditions expose. Reliability must be part of the architecture from the start, not a post-fix applied after the first batch of field returns.

1. Read‑Only Root Filesystems

Embedded devices often use read‑only rootfs to:

Prevent corruption
Support power loss resilience

How Overlayfs Works

graph TB
    subgraph "What the system sees (merged view)"
        M[Merged Filesystem<br>/]
    end
    subgraph "Layers"
        RW[Read-Write Layer<br>RAM or separate partition<br>Runtime changes go here]
        RO[Read-Only Layer<br>SD card rootfs<br>Cannot be corrupted]
    end

    RW --> M
    RO --> M

Overlayfs merges two layers into a single view:

Lower layer (read-only): The root filesystem on the SD card. Never modified during operation.
Upper layer (read-write): A RAM-backed tmpfs or a separate partition. All runtime writes (temp files, logs, package installs) land here.
Merged view: Applications see a normal writable filesystem, but the base image stays pristine.

On reboot, the upper layer is discarded (if RAM-backed), restoring the system to its known-good state. This is why consumer routers and set-top boxes survive power cuts — their firmware uses this exact pattern.

For persistent data (sensor logs, configuration), write to a dedicated writable partition with explicit fsync() calls.

2. Updates and Rollback

Deploying a firmware update to a device in the field is one of the most dangerous operations in embedded engineering. If the update is interrupted — by a power cut, a network drop, or a corrupted download — the device must not end up in an unbootable state. This is why embedded updates follow three principles:

Atomic — the update either fully succeeds or fully fails; there is no "half-updated" state. You write the new image to an inactive partition and switch only after verification. The running system is never modified during the update.
Verifiable — before switching to the new image, you check its cryptographic signature and checksum. If verification fails, you discard the image and keep running the current version.
Recoverable — if the new image boots but fails (service crash, hardware incompatibility, configuration error), the system automatically falls back to the previous working image. The device is never bricked by a bad update.

The most common strategy to achieve all three properties is A/B partition layout with automatic rollback.

A/B Partition Update Flow

sequenceDiagram
    participant S as Update Server
    participant D as Device
    participant A as Partition A (active)
    participant B as Partition B (inactive)

    D->>S: Check for update
    S->>D: New image available
    D->>B: Write new image to inactive partition
    D->>D: Verify checksum
    D->>D: Set boot flag → B
    D->>D: Reboot into B
    Note over D,B: If B boots successfully → mark B as good
    Note over D,A: If B fails → watchdog triggers → reboot into A (rollback)

The key principle: never modify the running partition. Write the update to the inactive partition, verify it, then switch. If the new image fails to boot, the watchdog triggers a rollback to the previous known-good partition.

3. Watchdogs

Even with a reliable filesystem and a robust update mechanism, software can still hang. A memory leak that grows over days, a deadlock triggered by a rare race condition, or a driver that enters an infinite retry loop — any of these can leave your device unresponsive. On a microcontroller, a hung program is often caught by the operator who notices the frozen display. On an embedded Linux device deployed in a remote location, nobody is watching.

Watchdogs exist to recover from these situations automatically. The idea is simple: a timer counts down, and if software does not reset it before it reaches zero, the hardware forces a reboot. As long as the software is healthy and running, it periodically "kicks" the watchdog to prevent the reboot. If the software freezes, the kick stops, the timer expires, and the system reboots into a known-good state.

Two levels of watchdogs provide defense in depth:

Hardware watchdogs — a timer built into the SoC that forces a hard reboot if not kicked
Software watchdogs — systemd monitors individual services and restarts them if they stop responding

Hardware Watchdog in Practice

The hardware watchdog is a timer that resets the system if software stops responding. Your application must periodically "kick" (reset) the watchdog timer. If it fails to kick within the timeout, the hardware forces a reboot.

On Raspberry Pi:

# Enable hardware watchdog
sudo modprobe bcm2835_wdt

# Test: read from watchdog device (opens and starts the timer)
cat /dev/watchdog
# System will reboot in ~15 seconds if nothing kicks the timer!

Watchdog Kicking in Practice

Opening /dev/watchdog starts the countdown timer. Your application must write to it periodically. If it stops writing, the hardware reboots the system:

#!/bin/bash
# watchdog_kick.sh — minimal watchdog kicker
# WARNING: once started, the watchdog WILL reboot if this script stops

exec 3>/dev/watchdog          # open watchdog — timer starts now

while true; do
    echo "V" >&3              # kick the watchdog (any write resets the timer)
    sleep 5                   # must be shorter than the watchdog timeout
done
# If this loop stops (crash, hang), reboot in ~15 seconds

systemd Watchdog Integration

systemd can monitor services and kick the hardware watchdog automatically. This is the recommended approach — you do not write to /dev/watchdog yourself:

[Unit]
Description=My Sensor Application

[Service]
Type=notify
ExecStart=/usr/bin/my-sensor-app
WatchdogSec=30
Restart=on-watchdog
RestartSec=5

[Install]
WantedBy=multi-user.target

The application must call sd_notify("WATCHDOG=1") periodically:

import sdnotify
import time

n = sdnotify.SystemdNotifier()
n.notify("READY=1")                # tell systemd we started OK

while True:
    # ... do sensor work ...
    n.notify("WATCHDOG=1")         # "I'm alive" — must arrive within WatchdogSec
    time.sleep(10)

Enable systemd to kick the hardware watchdog on behalf of all services by adding to /etc/systemd/system.conf:

RuntimeWatchdogSec=15

This creates a two-level recovery system:

Service-level: systemd restarts the hung service
System-level: hardware watchdog reboots if systemd itself hangs

4. Reliability Design Rules

These rules distill the lessons from thousands of field deployments. Each one addresses a specific failure mode that teams discover the hard way when they skip it:

Treat storage as failure-prone — SD cards wear out, eMMC has write limits, and any storage can corrupt during a power cut. Design your filesystem layout assuming writes can fail at any point. Use fsync() for critical data and checksums for integrity verification.
Keep mutable data separate from system image — the system image should be read-only so it cannot be corrupted during operation. Sensor logs, configuration changes, and runtime state go on a dedicated writable partition. This separation means a corrupted data partition does not prevent the device from booting.
Design update rollback before first release — if the first firmware update you push to the field bricks a device, you lose customer trust permanently. A/B partition layout with watchdog-triggered rollback should be in place before the first device ships, not added after the first incident.
Make failures observable in logs and status endpoints — a device that fails silently is worse than one that fails loudly. Expose health metrics (uptime, last successful sensor read, watchdog kick count) via a status endpoint or log file so you can detect degraded devices remotely before they fail completely.

4A. Failure Rate Mathematics

The reliability design rules above are qualitative. This section provides the quantitative foundation — the mathematics that lets you predict how long a device will survive and which component will fail first.

Failure Rate and MTBF

The failure rate \(\lambda\) (failures per unit time) and Mean Time Between Failures (MTBF) are related by:

\[\text{MTBF} = \frac{1}{\lambda}\]

For electronic components during their useful life (constant failure rate), the probability of surviving to time \(t\) is:

\[R(t) = e^{-\lambda t}\]

Example: A component with \(\lambda = 10^{-5}\text{ /hr}\) (one failure per 100,000 hours):

MTBF = \(1/10^{-5}\) = 100,000 hr ≈ 11.4 years
Probability of surviving 1 year (8,760 hr): \(R(8760) = e^{-0.0876} \approx 0.916\) (91.6%)
Probability of surviving 5 years: \(R(43800) = e^{-0.438} \approx 0.645\) (64.5%)

Warning

MTBF is not the expected lifetime. A component with MTBF of 11.4 years has a 37% chance of failing within the first 11.4 years (because \(R(\text{MTBF}) = e^{-1} \approx 0.368\)). MTBF describes a rate, not a guarantee.

The Bathtub Curve

Component failure rates follow a characteristic pattern over their lifetime:

Failure
 Rate λ
   │
   │\                                         /
   │ \          Useful Life                  /
   │  \     (constant λ — exponential)     /
   │   \________________________________ /
   │    Infant        ↑               Wear-out
   │   mortality   This is where      (increasing λ)
   │  (decreasing λ) MTBF applies
   └──────────────────────────────────────── Time

Infant mortality (early failures): Manufacturing defects. Mitigated by burn-in testing.
Useful life (constant \(\lambda\)): Random failures. The exponential model \(R(t) = e^{-\lambda t}\) applies here.
Wear-out (increasing \(\lambda\)): Physical degradation. Flash memory, electrolytic capacitors, and mechanical parts enter this phase.

Series Reliability

When components are in series (all must work for the system to function), the system reliability is the product:

\[R_{system}(t) = \prod_{i=1}^{n} R_i(t)\]

Example: A system with 5 independent components, each with \(R_i = 0.99\) (99% reliability over the mission period):

\[R_{system} = 0.99^5 = 0.951\]

The system reliability (95.1%) is lower than any individual component. With 20 components at 0.99 each: \(R_{system} = 0.99^{20} = 0.818\) — an 18% failure probability despite each part being "99% reliable." This is why component count matters and why minimizing the BOM (bill of materials) improves reliability.

Flash/SD Card Endurance

SD cards and eMMC have a finite number of program/erase (P/E) cycles per cell. The endurance calculation:

\[\text{Life (days)} = \frac{\text{Rated P/E cycles} \times \text{Capacity}}{\text{WAF} \times \text{Daily writes}}\]

where WAF (Write Amplification Factor) accounts for the flash translation layer writing more data than the host requests (due to garbage collection, wear leveling, and metadata updates).

Example: 32 GB industrial SD card, 3,000 P/E cycles, 1 GB/day of application writes:

Scenario	WAF	Life
Sequential writes (best case)	1.1	\(\frac{3000 \times 32}{1.1 \times 1} = 87,273\text{ days} \approx 239\text{ years}\)
Mixed sequential + random	2.0	\(\frac{3000 \times 32}{2.0 \times 1} = 48,000\text{ days} \approx 131\text{ years}\)
Small random writes (worst case)	10–20	\(\frac{3000 \times 32}{15 \times 1} = 6,400\text{ days} \approx 17.5\text{ years}\)

The WAF for small random writes (database transactions, frequent log flushes) can be 10–20×, dramatically reducing card life. This is why the Data Logger Appliance tutorial uses buffered sequential writes and avoids journaling filesystems on SD cards.

Weibull Distribution (Brief)

The exponential model assumes constant failure rate, which only holds during the useful life phase. The Weibull distribution generalizes this:

\[R(t) = e^{-(t/\eta)^\beta}\]

where \(\eta\) is the characteristic life and \(\beta\) is the shape parameter:

\(\beta = 1\): Constant failure rate (reduces to exponential) — random failures
\(\beta < 1\): Decreasing failure rate — infant mortality
\(\beta > 1\): Increasing failure rate — wear-out

For flash memory wear-out analysis, \(\beta > 1\) is appropriate. Manufacturers use Weibull analysis to set endurance ratings. For most embedded engineering, the exponential model (\(\beta = 1\)) is sufficient for back-of-envelope reliability estimates.

5. Testing Reliability (Not Just Function)

Functional testing verifies that the device works correctly under normal conditions. Reliability testing verifies that the device recovers correctly when things go wrong. These are fundamentally different activities, and most teams under-invest in the latter. A device that passes all functional tests but has never been power-cut-tested will fail in the field — it is only a matter of time.

Reliability tests should be automated and repeated. A single successful power-cut test proves nothing; five consecutive successful tests build confidence. Run explicit tests for each failure mode:

Power cut during write — verify the filesystem is intact after unclean shutdown
Update interruption — pull power mid-update and verify the device boots the previous image
Boot loop recovery — simulate a failing service and verify the watchdog triggers rollback
Watchdog trigger and restart verification — kill the watchdog-kicking process and verify the system reboots

Power-Cut Test Procedure

Start the system and verify normal operation
Let it run for at least 60 seconds (ensures write buffers have data)
Pull the power cable (do not use shutdown)
Reconnect power and verify:
System boots to login prompt (rootfs intact)
Application starts normally (systemd service)
Persistent data from before the last fsync() is present
Data after the last fsync() may be lost (expected)
Repeat 5 times to build confidence

Quick Checks (In Practice)

can device boot after unclean shutdown?
can failed update auto-rollback?
can app crash without bricking system?
is watchdog configured and tested?

Mini Exercise

Design an update strategy for a weather station deployed in a remote field location (no physical access for 6 months):

Image layout: How many partitions? What goes on each?
Rollback trigger: What condition triggers automatic rollback?
Health check: What does the device verify after a successful boot?
Data protection: How do you protect sensor logs from corruption?

Key Takeaways

Reliability is designed, not assumed.
Updates are an engineering discipline.
Watchdogs are a standard embedded safety tool.

Hands-On

Try this in practice: Tutorial: Data Logger Appliance — set up read-only root with overlayfs and test power-loss resilience.