PATTERN Cited by 1 source
Lock timeout for contention telemetry¶
Problem¶
A lock that's held too long by a writer looks identical in production to a deadlock:
- Requests hang.
- Process CPU stays normal.
- Process-level liveness probes (see patterns/watchdog-bounce-on-deadlock) trip and bounce the process.
- You have no direct signal for "we bounced because of a writer holding the Catalog for 4.2 seconds" vs "we bounced because of an unrecoverable deadlock".
Without discrimination, you can't tell if you should be optimising your critical sections (contention) or auditing for cycles / re-entrance (deadlock). See concepts/deadlock-vs-lock-contention.
Pattern¶
Use bounded-wait lock acquisition — e.g.
parking_lot's
try_write_for(Duration)
— that fails and returns an error on timeout instead of
blocking forever. Wire the timeout error into:
- Telemetry: emit a labeled log / metric for each timeout, including call-site, lock name, holder (if tracked), and any context (e.g. app ID, request ID).
- Failure-recovery path: the caller catches the timeout, returns a retryable error to its caller, or falls back to a degraded path.
- Dashboards: aggregate lock-timeout counts per lock-name over time. Spikes flag contention hot spots.
Combined with patterns/raii-to-explicit-closure-for-lock-visibility, the closure helper encapsulates the timeout + telemetry so call sites just specify the body.
Why it works¶
- Bounded failures beat unbounded hangs for every post-mortem metric. You get call-site context, app-ID context, and timing — none of which a watchdog-bounce core dump gives you.
- Forces the question of what the caller should do when the lock is congested. For request-scoped code the answer is usually "fail the request and let the client retry". For background code it's often "skip this cycle and try again later". Either is better than "hang until watchdog".
- Discriminates deadlock from contention. Under contention, timeouts fire, the logs fill up, but the process keeps making progress as the holder eventually releases. Under deadlock, timeouts fire and the process can't recover; only the watchdog bounce clears it. Under lock-word corruption, timeouts fire and every thread reports the same "no holder" shape — a third, diagnostic signature.
Costs¶
- Choosing the duration. Too short → spurious failures under legitimate load spikes. Too long → watchdog still fires first on severe cases.
- Error-handling sprawl. Every critical section now has a failure path the caller must handle.
- Not a fix — the telemetry is diagnostic; a hot lock still needs a redesign (finer-grained locking, work off the hot path, or a lock-free data structure).
- Writer starvation: if readers are held long enough
that writers systematically time out, your system may
effectively stop writing. Under
parking_lot's writer-preference this is less of an issue, but under a reader-preference RWLock it's a real risk.
Canonical instance — Fly.io's Round-2 refactor¶
From the 2025-05-28 parking_lot post:
"Before rolling out a new lazy-loading
fly-proxy, we do some refactoring: - our Catalog write locks all time out, so we'll get telemetry and a failure recovery path if that's what's choking the proxy to death…" (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)
The timeout refactor was how Fly planned to discriminate contention from deadlock. When the lockups recurred and the timeout logs spammed without ever showing the holder at fault, that evidence was what ruled out both theories, forcing the descent into madness that eventually surfaced the bitwise double-free root cause.
Seen in¶
- sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Canonical wiki instance. Also the canonical example of "instrumentation produces unexpectedly misleading evidence" — timeout logs fired just before lockups even though no thread actually held the lock.
Related¶
- systems/parking-lot-rust — Provides
try_write_foras the primitive. - systems/fly-proxy — Applied here.
- concepts/deadlock-vs-lock-contention — The discrimination this pattern enables.
- patterns/raii-to-explicit-closure-for-lock-visibility — Natural pairing; the helper function wraps the timeout + telemetry.
- patterns/watchdog-bounce-on-deadlock — The safety net that catches what this pattern can't recover.
- companies/flyio — Fly.io.