Skip to content

CONCEPT Cited by 2 sources

Deadlock vs lock contention

Deadlock and lock contention can look identical to an outside observer watching a process hang, but they are different failure modes with different fixes:

  • Deadlock: threads are waiting for locks in a way that forms a cycle — no thread can make progress without another releasing, and none will. Permanent.
  • Contention: threads are waiting for a lock that is held, but the holder will eventually release. Transient (though can be arbitrarily long).

Why it matters

A process-level liveness monitor (e.g. a watchdog on an internal REPL channel) sees both the same way — the process stops responding. Fly.io's 2025-05-28 Catalog lazy-loading rollout triggered a watchdog-bounce fleet-wide in Europe and Fly had to work out which pathology they were seeing:

"From the information we have, we've narrowed things down to two suspects. First, lazy-loading changes the read/write patterns and thus the pressure on the RWLocks the Catalog uses; it could just be lock contention. Second, we spot a suspicious if let." (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)

Discrimination technique: bounded lock acquisition

The load-bearing tool for telling them apart is try_write_for(Duration) (or equivalent timeout-bounded acquisition). See patterns/lock-timeout-for-contention-telemetry.

  • Pure contention: timeout fires but no watchdog bounce — the holder eventually releases, timing-out write retries. You see a heavy telemetry spike without a process bounce.
  • Pure deadlock: timeout fires, still no progress, still no progress. Watchdog bounces anyway.
  • Lock-word corruption (the Fly.io bug): timeout fires, telemetry logs spam, watchdog bounces. Looks like deadlock but every stack trace shows threads waiting with no thread holding. (See concepts/descent-into-madness-debugging.)

Why "it contended at scale" is the right intermediate

hypothesis

The Round-1 Fly.io investigation landed on contention first — a good prior when a refactor changes the read/write pattern on a lock. The fix for contention (shorter critical sections, finer-grained locking, moving work off the hot path) is non-destructive and often useful regardless of whether it's the root cause. The discriminating evidence came from the telemetry path — slow-write logs that appeared only just before the lockup, in benign quiet applications, were inconsistent with contention and consistent with lock-word corruption.

Seen in

  • sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Fly.io's 2025-05-28 debugging arc where Round 2's try_write_for refactor + lock-timeout telemetry was explicitly motivated by the need to separate these two pathologies. The refactor stood on its own (Fly.io keeps it post-fix) even when it turned out the true bug was neither pure deadlock nor pure contention but a bitwise double-free corrupting the lock word.
  • sources/2024-07-29-netflix-java-21-virtual-threads-dude-wheres-my-lockthird variant canonicalised: starvation deadlock via carrier-thread exhaustion. Netflix's Java 21 VT-pinning incident is neither pure deadlock (no cycle) nor pure contention (not transient — it genuinely cannot resolve). Four virtual threads pinned inside synchronized blocks exhausted all 4 carrier threads on a 4-vCPU host; the lock owner (AsyncReporter flusher) released via Condition.awaitNanos, timed out, and was re-queued by AQS's FIFO protocol behind the pinned VTs. The pinned VTs are holding carriers, so nothing can run on those carriers — and the flusher can't jump the queue to reacquire the lock. Starvation is structural, not probabilistic.

Third variant: starvation deadlock via carrier-thread exhaustion

Netflix's 2024-07-29 virtual-thread pinning incident surfaces a third pathology class beyond pure deadlock and pure contention:

  • Pure deadlock: cycle among N lock holders.
  • Pure contention: lock contested, holder will eventually release.
  • Starvation deadlock: one lock, N waiters, no current owner. The recent owner released (e.g. via Condition.awaitNanos) and is queued behind other waiters. Queueing discipline + a secondary bug (waiters pinning carrier threads) mean the secondary waiters can never acquire — they're waiting on running to happen, but the owner of a resource running requires is behind them.

VT pinning is one instance of this class (waiters hold carriers they can't release). Similar shapes exist whenever lock-acquisition ordering is decoupled from resource-availability ordering.

Diagnostic: the thread dump shows N waiters and no owner — both watchdog bounce and lock-timeout telemetry are blind to this case at the thread-dump level (the Java 21 jcmd dump literally drops the AQS state). The fix is heap-dump introspection to read the AQS state directly and reconstruct the queue.

Last updated · 542 distilled / 1,571 read