Skip to content

PATTERN Cited by 1 source

Lock timeout for contention telemetry

Problem

A lock that's held too long by a writer looks identical in production to a deadlock:

  • Requests hang.
  • Process CPU stays normal.
  • Process-level liveness probes (see patterns/watchdog-bounce-on-deadlock) trip and bounce the process.
  • You have no direct signal for "we bounced because of a writer holding the Catalog for 4.2 seconds" vs "we bounced because of an unrecoverable deadlock".

Without discrimination, you can't tell if you should be optimising your critical sections (contention) or auditing for cycles / re-entrance (deadlock). See concepts/deadlock-vs-lock-contention.

Pattern

Use bounded-wait lock acquisition — e.g. parking_lot's try_write_for(Duration) — that fails and returns an error on timeout instead of blocking forever. Wire the timeout error into:

  1. Telemetry: emit a labeled log / metric for each timeout, including call-site, lock name, holder (if tracked), and any context (e.g. app ID, request ID).
  2. Failure-recovery path: the caller catches the timeout, returns a retryable error to its caller, or falls back to a degraded path.
  3. Dashboards: aggregate lock-timeout counts per lock-name over time. Spikes flag contention hot spots.

Combined with patterns/raii-to-explicit-closure-for-lock-visibility, the closure helper encapsulates the timeout + telemetry so call sites just specify the body.

Why it works

  • Bounded failures beat unbounded hangs for every post-mortem metric. You get call-site context, app-ID context, and timing — none of which a watchdog-bounce core dump gives you.
  • Forces the question of what the caller should do when the lock is congested. For request-scoped code the answer is usually "fail the request and let the client retry". For background code it's often "skip this cycle and try again later". Either is better than "hang until watchdog".
  • Discriminates deadlock from contention. Under contention, timeouts fire, the logs fill up, but the process keeps making progress as the holder eventually releases. Under deadlock, timeouts fire and the process can't recover; only the watchdog bounce clears it. Under lock-word corruption, timeouts fire and every thread reports the same "no holder" shape — a third, diagnostic signature.

Costs

  • Choosing the duration. Too short → spurious failures under legitimate load spikes. Too long → watchdog still fires first on severe cases.
  • Error-handling sprawl. Every critical section now has a failure path the caller must handle.
  • Not a fix — the telemetry is diagnostic; a hot lock still needs a redesign (finer-grained locking, work off the hot path, or a lock-free data structure).
  • Writer starvation: if readers are held long enough that writers systematically time out, your system may effectively stop writing. Under parking_lot's writer-preference this is less of an issue, but under a reader-preference RWLock it's a real risk.

Canonical instance — Fly.io's Round-2 refactor

From the 2025-05-28 parking_lot post:

"Before rolling out a new lazy-loading fly-proxy, we do some refactoring: - our Catalog write locks all time out, so we'll get telemetry and a failure recovery path if that's what's choking the proxy to death…" (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)

The timeout refactor was how Fly planned to discriminate contention from deadlock. When the lockups recurred and the timeout logs spammed without ever showing the holder at fault, that evidence was what ruled out both theories, forcing the descent into madness that eventually surfaced the bitwise double-free root cause.

Seen in

Last updated · 200 distilled / 1,178 read