CONCEPT Cited by 3 sources
Thundering herd¶
Definition¶
A thundering herd is a failure mode where a resource is overwhelmed by too many simultaneous requests, typically because many clients were previously blocked / disconnected / idle and are now all released at the same instant. The resource has no partial route — there is nowhere for load to spill — and either tips over or adds enough latency to chain into downstream failures.
The metaphor, per Figma's LiveGraph post (sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale): "The name is derived from the vision of a huge herd of bulls coming at you; there's nowhere to run."
Classic shapes¶
- Cache-cold-after-deploy / cache-restart stampede. A tier-2 cache is wiped (deploy, crash, flush). All clients that previously hit the cache now simultaneously hit the backing store. If the backing store's capacity was sized for cached-traffic-plus-slack, it tips over.
- Reconnection stampede. A WebSocket tier drops many connections; reconnect retries are synchronised (all clients see the outage at the same instant) and hit the connection-establishment path simultaneously. Figma's FigCache post (sources/2026-04-21-figma-figcache-next-generation-data-caching-platform) names this as a structural Redis-connection problem pre-FigCache: "Thundering-herd connection establishment whenever client services scaled out quickly — bottlenecking I/O, degrading availability."
- Lock contention on a single popular key. Many readers arrive for the same expired key → all miss the cache → all start the same DB fetch → duplicate work and serialized back-pressure.
- Cron-second alignment. Every client runs the same job at
0 * * * *. Server load spikes at :00 regardless of hourly average.
Why it's a structural problem (not an operational one)¶
Thundering herd is not a capacity problem in the usual sense — average capacity is fine, concurrency peak is what breaks. It's a synchronization failure: something aligned all clients' "release" moments. That synchronizer is typically:
- A deploy — everyone reconnects/retries at once.
- A shared cache boundary — everyone expires / cold-starts the cache at once.
- A shared schedule — everyone's cron fires at :00.
- A shared outage recovery — everyone's retry timer hits T + max at the same moment.
Solutions therefore don't increase capacity; they de-synchronize the clients, or remove the boundary that aligns them.
Mitigations (by failure shape)¶
Cache wipe on deploy¶
- Deploy the cache separately from the front-end. Figma's LiveGraph 100x fix: the old in-server cache stampeded on every LiveGraph deploy; the new architecture puts the cache in a separate tier so the edge can redeploy without wiping caches (systems/livegraph, patterns/independent-scaling-tiers).
- Hot replicas on standby. Figma's new LiveGraph cache keeps warm replicas ready; during deploys, traffic flips to replicas without cold-starting the primary.
- Request coalescing ("singleflight"). If N clients miss the same key simultaneously, only one fetch runs; the other N-1 coalesce onto it. Figma's LiveGraph rendezvous layer makes this explicit.
- Warm-up scripts before cutover.
Reconnection stampede¶
- Exponential backoff with jitter — canonical AWS Architecture Blog / "Exponential Backoff And Jitter".
- Connection multiplexing — a shared proxy tier holds few persistent upstream connections on behalf of many client connections, so client-fleet fan-in doesn't map 1:1 to upstream connection establishment. systems/figcache is the canonical wiki instance; order-of-magnitude drop in Redis cluster connection counts post-rollout.
- Client-scale-out rate limits — slow how fast a fleet can scale out, so connection establishment doesn't saturate.
Hot-key expiry stampede¶
- Probabilistic early expiration — re-fetch a small % of requests before TTL expires, so expiry is amortised across requests, not concentrated at the instant TTL hits.
- Stale-while-revalidate — serve stale data while background refetch runs; only the first miss pays latency.
Cron alignment¶
- Jitter the schedule per client (
0 * * * *→random(0-59) * * * *). - Batch at the server rather than having N clients each call home.
Named production incidents¶
- Figma LiveGraph pre-100x — each deploy blew away the in-server cache; all clients reconnected at the same instant; DB took the cold fetches at the concurrency peak. "Creating a thundering herd that was getting bigger by the day." (sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale)
- Figma Redis pre-FigCache — client service scale-outs triggered simultaneous Redis connection establishments at Redis nodes' connection-count hard limits. Bespoke client-side connection pool was a localized patch; FigCache is the structural fix. (sources/2026-04-21-figma-figcache-next-generation-data-caching-platform)
Structural defence: don't let a shared boundary align clients¶
The strongest mitigation is to remove the synchronizer:
- Decouple cache tier from front-end deploy — Figma LiveGraph's move.
- Hold the expensive upstream connection in a separate tier that scales on its own axis, not with the client fleet — Figma FigCache's move.
- Isolate per-tenant cadence so one tenant's outage doesn't sync the other tenants' retries.
- Pre-warm before traffic shifts, so the cutover isn't the first traffic event.
This is the general shape of patterns/independent-scaling-tiers for caches.
Seen in¶
- — canonical database-proxy-tier instance. Jarod Reyes
(PlanetScale, 2021-09-30) names the specific thundering-herd shape
where a slow hot-row
SELECTcauses cascading database outage: "Often, the outages we see from customers who were on NoSQL or RDS databases are cascading outages due to an initial spike in query response times." Each arriving caller on the slow hot row occupies a separate upstream connection for its own full execution, so upstream pool occupancy scales with total callers not unique queries. Reyes canonicalises the Vitess query-consolidation primitive + consolidate-identical-in-flight-queries pattern as the proxy-tier structural fix: merge identical simultaneously-arriving queries into one upstream execution, fan the result back to all waiting callers, cap upstream pool pressure atO(unique queries in flight)rather thanO(total callers). Sister-primitive to concepts/connection-multiplexing at the cache-tier (Figma FigCache) and concepts/read-invalidation-rendezvous at the LiveGraph altitude — all three address the same structural class at different altitudes. - sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale — explicit definition (the bull-herd metaphor) and canonical cache-wipe-on-deploy incident shape; structural fix = separate cache tier + hot standbys.
- sources/2026-04-21-figma-figcache-next-generation-data-caching-platform — reconnection-stampede shape eliminated by connection multiplexing + drop-in RESP proxy (systems/figcache); named as a pre-FigCache scaling limit.
- — canonical async-job-framework instance. Mike Coutermarsh
(PlanetScale, 2022-02-17) names the thundering-herd shape where a
paired scheduler bulk-
enqueues 10,000 jobs that each call the same external API: without
jitter, workers drain the queue as fast as they can and all 10,000
requests arrive at the downstream within seconds. Structural fix =
jittered scheduling via
CleanUpJob.perform_with_jitter(id, max_wait: 30.minutes)which attaches a randomrand(0..max_wait)delay per job so execution spreads over the window. Same structural shape as cache-wipe stampedes (many clients released simultaneously) at a different altitude (outbound-to-downstream instead of inbound-to-cache). -
sources/2026-04-21-vercel-preventing-the-stampede-request-collapsing-in-the-vercel-cdn — canonical CDN-altitude instance. Vercel frames the ISR cache-expiry stampede explicitly: "Picture a page that just recently expired, or a new route getting hit for the first time. Multiple users request it simultaneously. Each request sees an empty cache and triggers a function invocation. […] For a popular route, this can mean dozens of simultaneous invocations, all regenerating the same page." Canonicalised as the child concept concepts/cache-stampede. Structural fix = per-region request collapsing with a two-level (node + regional) lock — the node lock is explicitly there to prevent the regional-lock acquisition itself from becoming a thundering herd: "Without the node-level grouping, hundreds of concurrent requests could all compete for the regional lock simultaneously. This would create a thundering herd problem where the lock coordination itself becomes a bottleneck." I.e. Vercel's design names TH as the failure mode both at the cache-miss layer (the problem) and at the naive-lock layer (the failure mode of a careless fix). Production numbers: 3M+/day collapsed on cache miss + 90M+/day on background revalidation, 100% of ISR projects auto-enrolled via framework-inferred cache policy. See systems/vercel-cdn for the full system.
-
— canonical benchmark-workload-design instance. Liz van Dijk (PlanetScale, 2022-09-08) names thundering herd as an explicit stressor that TAOBench is designed to simulate, via its
objects+edgesschema (concepts/social-graph-objects-and-edges): "Think of what happens when something goes viral: a thundering herd of users comes through to interact with a specific piece of content posted somewhere. On the database level, beyond a sudden surge in connections, this can also translate into various types of locks centered around the backing rows for that piece, which can have rippling effects that ultimately translate to slower content access times for the users on the platform." TAOBench is the first benchmark on this wiki that measures substrate thundering-herd response by design — distinct fromsysbench-tpcc's shard-key-aligned workload where no row attracts disproportionate concurrent traffic. The load-bearing framing pairs thundering herd with concepts/hot-row-problem as the two stressors viral content creates: the row-level contention (hot row) and the connection/lock fanout (thundering herd) are distinct failure modes that the social-graph workload exercises together.
Related¶
- concepts/cache-stampede — the cache-boundary sub-shape of the TH family; Vercel CDN's request-collapsing fix addresses exactly this sub-shape.
- concepts/request-collapsing — CDN-altitude structural fix for cache stampedes (cousin to query-consolidation at the SQL-proxy altitude).
- concepts/double-checked-locking — correctness protocol used by request-collapsing implementations.
- concepts/two-level-distributed-lock — scalability technique for request-collapsing at CDN scale; prevents the lock protocol itself from becoming a TH bottleneck.
- concepts/lock-timeout-hedging — bounded-wait failure policy that prevents slow invocations from cascading.
- concepts/connection-multiplexing — the proxy-tier structural fix for reconnection-stampede shape.
- concepts/query-consolidation — the proxy-tier structural fix for the hot-row-query-stampede shape at the SQL wire-protocol altitude.
- patterns/consolidate-identical-inflight-queries — the pattern that carries the query-consolidation primitive.
- patterns/independent-scaling-tiers — decouple deploy/scaling boundaries so there's no single "release" moment.
- concepts/read-invalidation-rendezvous — request-coalescing (a same-type-op dedup) mitigation for hot-key stampedes.
- patterns/caching-proxy-tier — the separate-tier deployment shape that prevents cache-deploy stampedes.