Skip to content

CONCEPT Cited by 2 sources

Thundering herd

Definition

A thundering herd is a failure mode where a resource is overwhelmed by too many simultaneous requests, typically because many clients were previously blocked / disconnected / idle and are now all released at the same instant. The resource has no partial route — there is nowhere for load to spill — and either tips over or adds enough latency to chain into downstream failures.

The metaphor, per Figma's LiveGraph post (sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale): "The name is derived from the vision of a huge herd of bulls coming at you; there's nowhere to run."

Classic shapes

  • Cache-cold-after-deploy / cache-restart stampede. A tier-2 cache is wiped (deploy, crash, flush). All clients that previously hit the cache now simultaneously hit the backing store. If the backing store's capacity was sized for cached-traffic-plus-slack, it tips over.
  • Reconnection stampede. A WebSocket tier drops many connections; reconnect retries are synchronised (all clients see the outage at the same instant) and hit the connection-establishment path simultaneously. Figma's FigCache post (sources/2026-04-21-figma-figcache-next-generation-data-caching-platform) names this as a structural Redis-connection problem pre-FigCache: "Thundering-herd connection establishment whenever client services scaled out quickly — bottlenecking I/O, degrading availability."
  • Lock contention on a single popular key. Many readers arrive for the same expired key → all miss the cache → all start the same DB fetch → duplicate work and serialized back-pressure.
  • Cron-second alignment. Every client runs the same job at 0 * * * *. Server load spikes at :00 regardless of hourly average.

Why it's a structural problem (not an operational one)

Thundering herd is not a capacity problem in the usual sense — average capacity is fine, concurrency peak is what breaks. It's a synchronization failure: something aligned all clients' "release" moments. That synchronizer is typically:

  • A deploy — everyone reconnects/retries at once.
  • A shared cache boundary — everyone expires / cold-starts the cache at once.
  • A shared schedule — everyone's cron fires at :00.
  • A shared outage recovery — everyone's retry timer hits T + max at the same moment.

Solutions therefore don't increase capacity; they de-synchronize the clients, or remove the boundary that aligns them.

Mitigations (by failure shape)

Cache wipe on deploy

  • Deploy the cache separately from the front-end. Figma's LiveGraph 100x fix: the old in-server cache stampeded on every LiveGraph deploy; the new architecture puts the cache in a separate tier so the edge can redeploy without wiping caches (systems/livegraph, patterns/independent-scaling-tiers).
  • Hot replicas on standby. Figma's new LiveGraph cache keeps warm replicas ready; during deploys, traffic flips to replicas without cold-starting the primary.
  • Request coalescing ("singleflight"). If N clients miss the same key simultaneously, only one fetch runs; the other N-1 coalesce onto it. Figma's LiveGraph rendezvous layer makes this explicit.
  • Warm-up scripts before cutover.

Reconnection stampede

  • Exponential backoff with jitter — canonical AWS Architecture Blog / "Exponential Backoff And Jitter".
  • Connection multiplexing — a shared proxy tier holds few persistent upstream connections on behalf of many client connections, so client-fleet fan-in doesn't map 1:1 to upstream connection establishment. systems/figcache is the canonical wiki instance; order-of-magnitude drop in Redis cluster connection counts post-rollout.
  • Client-scale-out rate limits — slow how fast a fleet can scale out, so connection establishment doesn't saturate.

Hot-key expiry stampede

  • Probabilistic early expiration — re-fetch a small % of requests before TTL expires, so expiry is amortised across requests, not concentrated at the instant TTL hits.
  • Stale-while-revalidate — serve stale data while background refetch runs; only the first miss pays latency.

Cron alignment

  • Jitter the schedule per client (0 * * * *random(0-59) * * * *).
  • Batch at the server rather than having N clients each call home.

Named production incidents

Structural defence: don't let a shared boundary align clients

The strongest mitigation is to remove the synchronizer:

  • Decouple cache tier from front-end deploy — Figma LiveGraph's move.
  • Hold the expensive upstream connection in a separate tier that scales on its own axis, not with the client fleet — Figma FigCache's move.
  • Isolate per-tenant cadence so one tenant's outage doesn't sync the other tenants' retries.
  • Pre-warm before traffic shifts, so the cutover isn't the first traffic event.

This is the general shape of patterns/independent-scaling-tiers for caches.

Seen in

Last updated · 200 distilled / 1,178 read