Skip to content

CONCEPT Cited by 2 sources

Contagious deadlock

What it is

A contagious deadlock is a concurrency bug that propagates via the shared data plane: when the deadlock-triggering input reaches a new node, that node also deadlocks. In a fleet running the same buggy binary and consuming the same shared state feed, one bad message can deadlock every node in the time it takes gossip (or pub-sub) to converge.

The contagion is the asymmetry that makes this class of bug catastrophic. Most production concurrency bugs are local-and-sporadic — one process at a time, different processes on different days. Contagious deadlocks are global-and-synchronised — the blast radius is the entire fleet, at once.

Mechanism

Two preconditions:

  1. Shared data plane — every node consumes updates from the same source, whether gossip, pub-sub, control plane broadcast, or config distribution.
  2. Deterministic trigger — the deadlock fires on a specific record, message, or input pattern. Any node that processes that input deadlocks; no node that hasn't yet is safe.

With both preconditions, the update-propagation mechanism becomes a blast amplifier:

"Distributed systems are blast amplifiers. By propagating data across a network, they also propagate bugs in the systems that depend on that data." (sources/2025-10-22-flyio-corrosion)

Canonical wiki instance — Fly.io 2024-09-01 Anycast outage

On 2024-09-01 15:30 EST, a Fly Machine came up with a new "virtual service" configuration option. Within seconds, every proxy in Fly's global fleet had locked up hard. No end-user requests reached any customer app for the duration.

The consumer bug was an if let expression over an RwLock in fly-proxy's Catalog reader — a notorious Rust concurrency footgun where the else branch wrongly assumes the read lock has been released. Any Corrosion update carrying the new virtual-service configuration hit this code path; Corrosion gossiped the update to every proxy in milliseconds; every proxy deadlocked in seconds.

Corrosion itself was "just a bystander" — the bug was in fly-proxy's consumer code. But the gossip layer is what made the blast amplifier perfect.

Mitigations

The 2024-09-01 incident drove four structural responses at Fly.io (all from sources/2025-10-22-flyio-corrosion):

  1. Fleet-wide Tokio watchdogs — turn any deadlock into a seconds-long hiccup per node, not an indefinite fleet-wide wedge.
  2. Antithesis / multiverse debugging — systematic search for similar bugs via deterministic replay.
  3. Checkpoint backups on object storage — ability to reboot the gossip cluster from a known-good state.
  4. Regionalization — reduce the blast radius of any future contagious bug by bounding most state changes to a single region.

Seen in

Last updated · 200 distilled / 1,178 read