CONCEPT Cited by 2 sources
Thundering herd¶
Definition¶
A thundering herd is a failure mode where a resource is overwhelmed by too many simultaneous requests, typically because many clients were previously blocked / disconnected / idle and are now all released at the same instant. The resource has no partial route — there is nowhere for load to spill — and either tips over or adds enough latency to chain into downstream failures.
The metaphor, per Figma's LiveGraph post (sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale): "The name is derived from the vision of a huge herd of bulls coming at you; there's nowhere to run."
Classic shapes¶
- Cache-cold-after-deploy / cache-restart stampede. A tier-2 cache is wiped (deploy, crash, flush). All clients that previously hit the cache now simultaneously hit the backing store. If the backing store's capacity was sized for cached-traffic-plus-slack, it tips over.
- Reconnection stampede. A WebSocket tier drops many connections; reconnect retries are synchronised (all clients see the outage at the same instant) and hit the connection-establishment path simultaneously. Figma's FigCache post (sources/2026-04-21-figma-figcache-next-generation-data-caching-platform) names this as a structural Redis-connection problem pre-FigCache: "Thundering-herd connection establishment whenever client services scaled out quickly — bottlenecking I/O, degrading availability."
- Lock contention on a single popular key. Many readers arrive for the same expired key → all miss the cache → all start the same DB fetch → duplicate work and serialized back-pressure.
- Cron-second alignment. Every client runs the same job at
0 * * * *. Server load spikes at :00 regardless of hourly average.
Why it's a structural problem (not an operational one)¶
Thundering herd is not a capacity problem in the usual sense — average capacity is fine, concurrency peak is what breaks. It's a synchronization failure: something aligned all clients' "release" moments. That synchronizer is typically:
- A deploy — everyone reconnects/retries at once.
- A shared cache boundary — everyone expires / cold-starts the cache at once.
- A shared schedule — everyone's cron fires at :00.
- A shared outage recovery — everyone's retry timer hits T + max at the same moment.
Solutions therefore don't increase capacity; they de-synchronize the clients, or remove the boundary that aligns them.
Mitigations (by failure shape)¶
Cache wipe on deploy¶
- Deploy the cache separately from the front-end. Figma's LiveGraph 100x fix: the old in-server cache stampeded on every LiveGraph deploy; the new architecture puts the cache in a separate tier so the edge can redeploy without wiping caches (systems/livegraph, patterns/independent-scaling-tiers).
- Hot replicas on standby. Figma's new LiveGraph cache keeps warm replicas ready; during deploys, traffic flips to replicas without cold-starting the primary.
- Request coalescing ("singleflight"). If N clients miss the same key simultaneously, only one fetch runs; the other N-1 coalesce onto it. Figma's LiveGraph rendezvous layer makes this explicit.
- Warm-up scripts before cutover.
Reconnection stampede¶
- Exponential backoff with jitter — canonical AWS Architecture Blog / "Exponential Backoff And Jitter".
- Connection multiplexing — a shared proxy tier holds few persistent upstream connections on behalf of many client connections, so client-fleet fan-in doesn't map 1:1 to upstream connection establishment. systems/figcache is the canonical wiki instance; order-of-magnitude drop in Redis cluster connection counts post-rollout.
- Client-scale-out rate limits — slow how fast a fleet can scale out, so connection establishment doesn't saturate.
Hot-key expiry stampede¶
- Probabilistic early expiration — re-fetch a small % of requests before TTL expires, so expiry is amortised across requests, not concentrated at the instant TTL hits.
- Stale-while-revalidate — serve stale data while background refetch runs; only the first miss pays latency.
Cron alignment¶
- Jitter the schedule per client (
0 * * * *→random(0-59) * * * *). - Batch at the server rather than having N clients each call home.
Named production incidents¶
- Figma LiveGraph pre-100x — each deploy blew away the in-server cache; all clients reconnected at the same instant; DB took the cold fetches at the concurrency peak. "Creating a thundering herd that was getting bigger by the day." (sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale)
- Figma Redis pre-FigCache — client service scale-outs triggered simultaneous Redis connection establishments at Redis nodes' connection-count hard limits. Bespoke client-side connection pool was a localized patch; FigCache is the structural fix. (sources/2026-04-21-figma-figcache-next-generation-data-caching-platform)
Structural defence: don't let a shared boundary align clients¶
The strongest mitigation is to remove the synchronizer:
- Decouple cache tier from front-end deploy — Figma LiveGraph's move.
- Hold the expensive upstream connection in a separate tier that scales on its own axis, not with the client fleet — Figma FigCache's move.
- Isolate per-tenant cadence so one tenant's outage doesn't sync the other tenants' retries.
- Pre-warm before traffic shifts, so the cutover isn't the first traffic event.
This is the general shape of patterns/independent-scaling-tiers for caches.
Seen in¶
- sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale — explicit definition (the bull-herd metaphor) and canonical cache-wipe-on-deploy incident shape; structural fix = separate cache tier + hot standbys.
- sources/2026-04-21-figma-figcache-next-generation-data-caching-platform — reconnection-stampede shape eliminated by connection multiplexing + drop-in RESP proxy (systems/figcache); named as a pre-FigCache scaling limit.
Related¶
- concepts/connection-multiplexing — the proxy-tier structural fix for reconnection-stampede shape.
- patterns/independent-scaling-tiers — decouple deploy/scaling boundaries so there's no single "release" moment.
- concepts/read-invalidation-rendezvous — request-coalescing (a same-type-op dedup) mitigation for hot-key stampedes.
- patterns/caching-proxy-tier — the separate-tier deployment shape that prevents cache-deploy stampedes.