Hardware · Jun 2026

OTA firmware updates that don't brick the fleet

Cristina DumitruFirmware engineering6 min read

Shipping firmware over the air is the feature that turns a product into a fleet you can improve after it leaves the factory. It is also the feature most likely to turn that fleet into a field full of bricks. A bad update pushed to ten thousand devices you cannot physically reach is not a bug; it is a recall. The whole discipline is making that outcome impossible by construction.

Two banks, one rollback

The foundation is an A/B partition scheme. Every device carries two firmware slots; the running image lives in one while an update is written to the other, untouched and inert. Only after the new image is fully received and its integrity verified does the bootloader switch the active slot — and even then, the new firmware must check in and confirm it is healthy within a watchdog window, or the bootloader reverts to the known-good image on the next reset. A device that fails to boot the new firmware falls back automatically to the old one. There is no state in which a half-written or broken image becomes the only thing the device can run.

Sign everything, trust nothing

An update channel is an attack surface, and a device that flashes whatever it is handed is a device waiting to be turned into someone else's botnet. Every image is cryptographically signed, and the bootloader verifies that signature against a key burned into the hardware before it will execute a single instruction. The transport can be compromised, the update server can be impersonated, the bytes can be tampered with in flight — none of it matters if the signature does not check out, because an unsigned or altered image is simply refused and the old one keeps running.

The bootloader's job is not to install the new firmware. It is to guarantee the device always has firmware it can fall back to.
— Protocore · Firmware engineering

Roll out in waves, watch the metrics

Even a perfect update mechanism does not justify shipping to the whole fleet at once. We stage rollouts: a small canary cohort first, then widening rings, with health telemetry — boot success, crash rates, battery and connectivity regressions — gating each expansion. If the canary's numbers wobble, the rollout pauses automatically and nothing past that ring ever receives the build. This is how a latent bug that slipped past testing reaches a hundred devices instead of a hundred thousand, and gets caught while a fix is still cheap.

Field devices are unforgiving in a way servers never are: there is no console to SSH into, no quick redeploy, sometimes no second chance at all. So the safety lives below the application, in a bootloader and a rollout system you can trust precisely because they are dumb, signed, and reversible. Get that layer right once and firmware updates stop being the scariest button in the building. Get it wrong and you learn why 'brick' became a verb.

Have a system to build?

Tell us the problem. We'll come back with an architecture and a plan.

Start a project