CONCEPT Cited by 1 source
Lock timeout hedging¶
Definition¶
Lock timeout hedging is the policy choice, in a blocking-lock
protocol, of "wait for the lock up to a bounded window; if it
doesn't come, give up waiting and do the work independently."
The waiter thereby caps its tail latency at timeout + local_work
rather than lock_holder_work + queueing.
It's a direct application of the hedging idea (tail-latency at scale) to lock acquisition: if the first attempt (wait for the lock) is taking longer than expected, start a parallel attempt (invoke independently) rather than continue waiting indefinitely.
In a request-collapsing implementation, this is the failure policy that prevents a slow or hung upstream invocation from blocking arbitrarily many waiters behind it.
Why it's necessary¶
A pure lock-wait implementation has a catastrophic failure mode: if the lock holder's work hangs or is slow, every subsequent arrival joins the wait queue and contributes to an ever-growing blocked fleet. The Vercel post names this precisely:
"Timeouts are more dangerous. If the lock holder takes a long time (or never completes) the other requests waiting on the lock could be stuck indefinitely. During a traffic spike, this can mean dozens or hundreds of requests pile up behind a single slow invocation. […] To prevent this, locks are created with explicit timeouts." (Source: sources/2026-04-21-vercel-preventing-the-stampede-request-collapsing-in-the-vercel-cdn)
The hedge is the escape valve: past the timeout, each waiter becomes its own independent worker. Collapsing is lost for that key for that window, but nothing is blocked — the system falls back to no-collapsing behaviour, which is merely wasteful, not broken.
The tradeoff¶
Lock timeout hedging is an explicit choice to prioritise bounded latency over bounded work:
| Property | Without timeout | With timeout |
|---|---|---|
| Max wait for waiters | ∞ (stuck behind hung holder) | ~timeout window |
| Max concurrent invocations | 1 (good-path) or ∞-stuck (bad-path) | bounded by wait / timeout + 1 |
| Common-case work | 1 invocation | 1 invocation |
| Slow-path work | 1 invocation (but everyone is blocked) | up to N invocations (everyone hedges) |
| Cascading failure risk | High (one slow key → dead region) | Low (slow key → extra work but bounded waits) |
The post frames this as "optimize for the common case while still remaining resilient to errors and long-tail latencies".
Parameter choice¶
The Vercel implementation uses 3 seconds as the lock timeout on both the node-level and regional-level locks (concepts/two-level-distributed-lock):
const nodeLock = createNodeLock(cacheKey, { timeout: 3000 });
const regionalLock = createRegionalLock(cacheKey, { timeout: 3000 });
The choice of 3 seconds is not justified in detail in the post, but the reasoning shape is:
- Long enough that typical ISR regenerations (sub-second) complete under the timeout, so collapsing succeeds in the common case.
- Short enough that a hung invocation can't block waiters past a TTFB budget the CDN is willing to absorb.
- Aligned across levels. Both locks have the same 3 s; a waiter that spends its full budget on the node lock won't then get a second full budget on the regional lock. (The post isn't explicit about how budget is shared, but the same value on both strongly implies "whichever comes first".)
Failure modes of the hedge itself¶
- Timeout too short. Waiters give up before the lock holder finishes → collapsing rarely happens in practice → effective behaviour converges to no-collapsing.
- Timeout too long. Slow invocations block enough requests for long enough to degrade the CDN's TTFB p99 materially.
- Systemic slow upstream. If every invocation is slow, every waiter times out, every waiter hedges → collapsing delivers zero benefit under exactly the conditions it was designed for. The post doesn't address this: "Even in the worst case, users continue to get responses, though at the cost of multiple invocations" — but this is the point where observability (invocation-count vs collapsed-count ratio) becomes the key signal that something upstream is wrong.
- Timeout amplification. Without careful budget-sharing, a waiter that spent 3 s on node lock + 3 s on regional lock has already burned 6 s of TTFB before invoking. Combined budget hits back on the original goal.
Compared to other hedging applications¶
- Hedged requests — same idiom applied to RPC calls. Send a backup request after a latency threshold; use whichever response arrives first.
- Request hedging in Envoy / Istio — upstream retry as hedge once p95 threshold is exceeded.
- Hedged observability stack — dual-writing to primary and backup telemetry paths with latency-based failover.
Lock timeout hedging is the pessimistic-wait-then-abandon shape: don't start a parallel attempt preemptively; wait up to the timeout, and only then hedge. This works because the "original attempt" is passive (waiting for a lock) rather than active (doing RPC work), so there's no resource wasted during the wait itself.
Seen in¶
- sources/2026-04-21-vercel-preventing-the-stampede-request-collapsing-in-the-vercel-cdn — canonical distributed-cache application. Explicit 3 s timeouts on both node and regional locks.
Related¶
- concepts/request-collapsing — the primary primitive whose failure mode lock timeout hedging bounds.
- concepts/cache-stampede — the failure class collapsing prevents; hedging makes the prevention robust to slow regenerations.
- concepts/double-checked-locking — correctness protocol that coexists with timeouts (the second check still runs after timeout-induced hedging; it's just an unlucky check).
- concepts/two-level-distributed-lock — the lock hierarchy whose timeouts hedging applies to.
- concepts/tail-latency-at-scale — the broader hedging family this is an instance of.
- concepts/thundering-herd — the class of failure hedged locks refuse to amplify.
- systems/vercel-cdn — canonical production implementation.