CONCEPT Cited by 2 sources
Contagious deadlock¶
What it is¶
A contagious deadlock is a concurrency bug that propagates via the shared data plane: when the deadlock-triggering input reaches a new node, that node also deadlocks. In a fleet running the same buggy binary and consuming the same shared state feed, one bad message can deadlock every node in the time it takes gossip (or pub-sub) to converge.
The contagion is the asymmetry that makes this class of bug catastrophic. Most production concurrency bugs are local-and-sporadic — one process at a time, different processes on different days. Contagious deadlocks are global-and-synchronised — the blast radius is the entire fleet, at once.
Mechanism¶
Two preconditions:
- Shared data plane — every node consumes updates from the same source, whether gossip, pub-sub, control plane broadcast, or config distribution.
- Deterministic trigger — the deadlock fires on a specific record, message, or input pattern. Any node that processes that input deadlocks; no node that hasn't yet is safe.
With both preconditions, the update-propagation mechanism becomes a blast amplifier:
"Distributed systems are blast amplifiers. By propagating data across a network, they also propagate bugs in the systems that depend on that data." (sources/2025-10-22-flyio-corrosion)
Canonical wiki instance — Fly.io 2024-09-01 Anycast outage¶
On 2024-09-01 15:30 EST, a Fly Machine came up with a new "virtual service" configuration option. Within seconds, every proxy in Fly's global fleet had locked up hard. No end-user requests reached any customer app for the duration.
The consumer bug was an
if let expression over an
RwLock in fly-proxy's Catalog reader — a notorious Rust
concurrency footgun where the else branch wrongly assumes
the read lock has been released. Any Corrosion update carrying
the new virtual-service configuration hit this code path;
Corrosion gossiped the update to
every proxy in milliseconds; every proxy deadlocked in
seconds.
Corrosion itself was "just a bystander" — the bug was in fly-proxy's consumer code. But the gossip layer is what made the blast amplifier perfect.
Mitigations¶
The 2024-09-01 incident drove four structural responses at Fly.io (all from sources/2025-10-22-flyio-corrosion):
- Fleet-wide Tokio watchdogs — turn any deadlock into a seconds-long hiccup per node, not an indefinite fleet-wide wedge.
- Antithesis / multiverse debugging — systematic search for similar bugs via deterministic replay.
- Checkpoint backups on object storage — ability to reboot the gossip cluster from a known-good state.
- Regionalization — reduce the blast radius of any future contagious bug by bounding most state changes to a single region.
Seen in¶
- sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — canonical retrospective of the 2024-09-01 contagious-deadlock at the bug-level detail (the parking_lot RwLock + if-let + bitwise-double-free chain).
- sources/2025-10-22-flyio-corrosion — canonical framing of the class: "blast amplifiers", the regionalization response, the Tokio-watchdog mitigation.
Related¶
- concepts/blast-radius — the failure-scope concept contagious deadlock detonates.
- concepts/if-let-lock-scope-bug — the specific bug shape in the 2024-09-01 instance.
- systems/fly-proxy — the affected consumer.
- systems/corrosion-swim — the propagation vehicle.
- patterns/watchdog-bounce-on-deadlock — the safety-net mitigation.
- patterns/two-level-regional-global-state — the structural mitigation.