CONCEPT Cited by 2 sources
Deadlock vs lock contention¶
Deadlock and lock contention can look identical to an outside observer watching a process hang, but they are different failure modes with different fixes:
- Deadlock: threads are waiting for locks in a way that forms a cycle — no thread can make progress without another releasing, and none will. Permanent.
- Contention: threads are waiting for a lock that is held, but the holder will eventually release. Transient (though can be arbitrarily long).
Why it matters¶
A process-level liveness monitor (e.g. a watchdog on an internal REPL channel) sees both the same way — the process stops responding. Fly.io's 2025-05-28 Catalog lazy-loading rollout triggered a watchdog-bounce fleet-wide in Europe and Fly had to work out which pathology they were seeing:
"From the information we have, we've narrowed things down to two suspects. First, lazy-loading changes the read/write patterns and thus the pressure on the RWLocks the Catalog uses; it could just be lock contention. Second, we spot a suspicious
if let." (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)
Discrimination technique: bounded lock acquisition¶
The load-bearing tool for telling them apart is
try_write_for(Duration) (or equivalent timeout-bounded
acquisition). See
patterns/lock-timeout-for-contention-telemetry.
- Pure contention: timeout fires but no watchdog bounce — the holder eventually releases, timing-out write retries. You see a heavy telemetry spike without a process bounce.
- Pure deadlock: timeout fires, still no progress, still no progress. Watchdog bounces anyway.
- Lock-word corruption (the Fly.io bug): timeout fires, telemetry logs spam, watchdog bounces. Looks like deadlock but every stack trace shows threads waiting with no thread holding. (See concepts/descent-into-madness-debugging.)
Why "it contended at scale" is the right intermediate¶
hypothesis
The Round-1 Fly.io investigation landed on contention first — a good prior when a refactor changes the read/write pattern on a lock. The fix for contention (shorter critical sections, finer-grained locking, moving work off the hot path) is non-destructive and often useful regardless of whether it's the root cause. The discriminating evidence came from the telemetry path — slow-write logs that appeared only just before the lockup, in benign quiet applications, were inconsistent with contention and consistent with lock-word corruption.
Seen in¶
- sources/2025-05-28-flyio-parking-lot-ffffffffffffffff —
Fly.io's 2025-05-28 debugging arc where Round 2's
try_write_forrefactor + lock-timeout telemetry was explicitly motivated by the need to separate these two pathologies. The refactor stood on its own (Fly.io keeps it post-fix) even when it turned out the true bug was neither pure deadlock nor pure contention but a bitwise double-free corrupting the lock word. - sources/2024-07-29-netflix-java-21-virtual-threads-dude-wheres-my-lock
— third variant canonicalised: starvation deadlock via
carrier-thread exhaustion. Netflix's Java 21 VT-pinning
incident is neither pure deadlock (no cycle) nor pure
contention (not transient — it genuinely cannot resolve).
Four virtual threads pinned inside
synchronizedblocks exhausted all 4 carrier threads on a 4-vCPU host; the lock owner (AsyncReporterflusher) released viaCondition.awaitNanos, timed out, and was re-queued by AQS's FIFO protocol behind the pinned VTs. The pinned VTs are holding carriers, so nothing can run on those carriers — and the flusher can't jump the queue to reacquire the lock. Starvation is structural, not probabilistic.
Third variant: starvation deadlock via carrier-thread exhaustion¶
Netflix's 2024-07-29 virtual-thread pinning incident surfaces a third pathology class beyond pure deadlock and pure contention:
- Pure deadlock: cycle among N lock holders.
- Pure contention: lock contested, holder will eventually release.
- Starvation deadlock: one lock, N waiters, no current
owner. The recent owner released (e.g. via
Condition.awaitNanos) and is queued behind other waiters. Queueing discipline + a secondary bug (waiters pinning carrier threads) mean the secondary waiters can never acquire — they're waiting on running to happen, but the owner of a resource running requires is behind them.
VT pinning is one instance of this class (waiters hold carriers they can't release). Similar shapes exist whenever lock-acquisition ordering is decoupled from resource-availability ordering.
Diagnostic: the thread dump shows N waiters and no owner
— both watchdog
bounce and
lock-timeout telemetry are blind to this case at the
thread-dump level (the Java 21 jcmd dump literally drops the
AQS state). The fix is
heap-dump
introspection to read the AQS state directly and
reconstruct the queue.
Related¶
- systems/parking-lot-rust —
try_write_foris aparking_lot-specific feature used here. - systems/fly-proxy — The system where this discrimination was applied.
- patterns/lock-timeout-for-contention-telemetry — The pattern that operationalises the discrimination.
- patterns/watchdog-bounce-on-deadlock — Why both pathologies look the same to the monitor.
- patterns/diagnose-via-heap-dump-lock-introspection — The diagnostic technique that distinguishes starvation deadlock from pure deadlock / pure contention when the thread dump is silent.
- concepts/if-let-lock-scope-bug — The 2024 Fly.io outage was pure deadlock; 2025-05-28 is neither pure deadlock nor pure contention.
- concepts/virtual-thread-pinning — The Java 21 failure mode that produces starvation deadlocks.
- systems/java-21-virtual-threads — The runtime whose pinning mechanism produces the third-variant case.
- companies/flyio — The company exemplifying this discrimination discipline.
- companies/netflix — Canonical starvation-deadlock instance.