CONCEPT Cited by 1 source
Deadlock vs lock contention¶
Deadlock and lock contention can look identical to an outside observer watching a process hang, but they are different failure modes with different fixes:
- Deadlock: threads are waiting for locks in a way that forms a cycle — no thread can make progress without another releasing, and none will. Permanent.
- Contention: threads are waiting for a lock that is held, but the holder will eventually release. Transient (though can be arbitrarily long).
Why it matters¶
A process-level liveness monitor (e.g. a watchdog on an internal REPL channel) sees both the same way — the process stops responding. Fly.io's 2025-05-28 Catalog lazy-loading rollout triggered a watchdog-bounce fleet-wide in Europe and Fly had to work out which pathology they were seeing:
"From the information we have, we've narrowed things down to two suspects. First, lazy-loading changes the read/write patterns and thus the pressure on the RWLocks the Catalog uses; it could just be lock contention. Second, we spot a suspicious
if let." (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)
Discrimination technique: bounded lock acquisition¶
The load-bearing tool for telling them apart is
try_write_for(Duration) (or equivalent timeout-bounded
acquisition). See
patterns/lock-timeout-for-contention-telemetry.
- Pure contention: timeout fires but no watchdog bounce — the holder eventually releases, timing-out write retries. You see a heavy telemetry spike without a process bounce.
- Pure deadlock: timeout fires, still no progress, still no progress. Watchdog bounces anyway.
- Lock-word corruption (the Fly.io bug): timeout fires, telemetry logs spam, watchdog bounces. Looks like deadlock but every stack trace shows threads waiting with no thread holding. (See concepts/descent-into-madness-debugging.)
Why "it contended at scale" is the right intermediate¶
hypothesis
The Round-1 Fly.io investigation landed on contention first — a good prior when a refactor changes the read/write pattern on a lock. The fix for contention (shorter critical sections, finer-grained locking, moving work off the hot path) is non-destructive and often useful regardless of whether it's the root cause. The discriminating evidence came from the telemetry path — slow-write logs that appeared only just before the lockup, in benign quiet applications, were inconsistent with contention and consistent with lock-word corruption.
Seen in¶
- sources/2025-05-28-flyio-parking-lot-ffffffffffffffff —
Fly.io's 2025-05-28 debugging arc where Round 2's
try_write_forrefactor + lock-timeout telemetry was explicitly motivated by the need to separate these two pathologies. The refactor stood on its own (Fly.io keeps it post-fix) even when it turned out the true bug was neither pure deadlock nor pure contention but a bitwise double-free corrupting the lock word.
Related¶
- systems/parking-lot-rust —
try_write_foris aparking_lot-specific feature used here. - systems/fly-proxy — The system where this discrimination was applied.
- patterns/lock-timeout-for-contention-telemetry — The pattern that operationalises the discrimination.
- patterns/watchdog-bounce-on-deadlock — Why both pathologies look the same to the monitor.
- concepts/if-let-lock-scope-bug — The 2024 Fly.io outage was pure deadlock; 2025-05-28 is neither pure deadlock nor pure contention.
- companies/flyio — The company exemplifying this discrimination discipline.