Skip to content

CONCEPT Cited by 1 source

Deadlock vs lock contention

Deadlock and lock contention can look identical to an outside observer watching a process hang, but they are different failure modes with different fixes:

  • Deadlock: threads are waiting for locks in a way that forms a cycle — no thread can make progress without another releasing, and none will. Permanent.
  • Contention: threads are waiting for a lock that is held, but the holder will eventually release. Transient (though can be arbitrarily long).

Why it matters

A process-level liveness monitor (e.g. a watchdog on an internal REPL channel) sees both the same way — the process stops responding. Fly.io's 2025-05-28 Catalog lazy-loading rollout triggered a watchdog-bounce fleet-wide in Europe and Fly had to work out which pathology they were seeing:

"From the information we have, we've narrowed things down to two suspects. First, lazy-loading changes the read/write patterns and thus the pressure on the RWLocks the Catalog uses; it could just be lock contention. Second, we spot a suspicious if let." (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)

Discrimination technique: bounded lock acquisition

The load-bearing tool for telling them apart is try_write_for(Duration) (or equivalent timeout-bounded acquisition). See patterns/lock-timeout-for-contention-telemetry.

  • Pure contention: timeout fires but no watchdog bounce — the holder eventually releases, timing-out write retries. You see a heavy telemetry spike without a process bounce.
  • Pure deadlock: timeout fires, still no progress, still no progress. Watchdog bounces anyway.
  • Lock-word corruption (the Fly.io bug): timeout fires, telemetry logs spam, watchdog bounces. Looks like deadlock but every stack trace shows threads waiting with no thread holding. (See concepts/descent-into-madness-debugging.)

Why "it contended at scale" is the right intermediate

hypothesis

The Round-1 Fly.io investigation landed on contention first — a good prior when a refactor changes the read/write pattern on a lock. The fix for contention (shorter critical sections, finer-grained locking, moving work off the hot path) is non-destructive and often useful regardless of whether it's the root cause. The discriminating evidence came from the telemetry path — slow-write logs that appeared only just before the lockup, in benign quiet applications, were inconsistent with contention and consistent with lock-word corruption.

Seen in

  • sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Fly.io's 2025-05-28 debugging arc where Round 2's try_write_for refactor + lock-timeout telemetry was explicitly motivated by the need to separate these two pathologies. The refactor stood on its own (Fly.io keeps it post-fix) even when it turned out the true bug was neither pure deadlock nor pure contention but a bitwise double-free corrupting the lock word.
Last updated · 200 distilled / 1,178 read