CONCEPT Cited by 1 source

Deadlock vs lock contention¶

Deadlock and lock contention can look identical to an outside observer watching a process hang, but they are different failure modes with different fixes:

Deadlock: threads are waiting for locks in a way that forms a cycle — no thread can make progress without another releasing, and none will. Permanent.
Contention: threads are waiting for a lock that is held, but the holder will eventually release. Transient (though can be arbitrarily long).

Why it matters¶

A process-level liveness monitor (e.g. a watchdog on an internal REPL channel) sees both the same way — the process stops responding. Fly.io's 2025-05-28 Catalog lazy-loading rollout triggered a watchdog-bounce fleet-wide in Europe and Fly had to work out which pathology they were seeing:

"From the information we have, we've narrowed things down to two suspects. First, lazy-loading changes the read/write patterns and thus the pressure on the RWLocks the Catalog uses; it could just be lock contention. Second, we spot a suspicious if let." (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)

Discrimination technique: bounded lock acquisition¶

The load-bearing tool for telling them apart is try_write_for(Duration) (or equivalent timeout-bounded acquisition). See patterns/lock-timeout-for-contention-telemetry.

Pure contention: timeout fires but no watchdog bounce — the holder eventually releases, timing-out write retries. You see a heavy telemetry spike without a process bounce.
Pure deadlock: timeout fires, still no progress, still no progress. Watchdog bounces anyway.
Lock-word corruption (the Fly.io bug): timeout fires, telemetry logs spam, watchdog bounces. Looks like deadlock but every stack trace shows threads waiting with no thread holding. (See concepts/descent-into-madness-debugging.)

Why "it contended at scale" is the right intermediate¶

hypothesis

The Round-1 Fly.io investigation landed on contention first — a good prior when a refactor changes the read/write pattern on a lock. The fix for contention (shorter critical sections, finer-grained locking, moving work off the hot path) is non-destructive and often useful regardless of whether it's the root cause. The discriminating evidence came from the telemetry path — slow-write logs that appeared only just before the lockup, in benign quiet applications, were inconsistent with contention and consistent with lock-word corruption.

Seen in¶

sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Fly.io's 2025-05-28 debugging arc where Round 2's try_write_for refactor + lock-timeout telemetry was explicitly motivated by the need to separate these two pathologies. The refactor stood on its own (Fly.io keeps it post-fix) even when it turned out the true bug was neither pure deadlock nor pure contention but a bitwise double-free corrupting the lock word.

systems/parking-lot-rust — try_write_for is a parking_lot-specific feature used here.
systems/fly-proxy — The system where this discrimination was applied.
patterns/lock-timeout-for-contention-telemetry — The pattern that operationalises the discrimination.
patterns/watchdog-bounce-on-deadlock — Why both pathologies look the same to the monitor.
concepts/if-let-lock-scope-bug — The 2024 Fly.io outage was pure deadlock; 2025-05-28 is neither pure deadlock nor pure contention.
companies/flyio — The company exemplifying this discrimination discipline.

Deadlock vs lock contention¶

Why it matters¶

Discrimination technique: bounded lock acquisition¶

Why "it contended at scale" is the right intermediate¶

Seen in¶

Related¶