Skip to content

CONCEPT Cited by 1 source

Active multi-cluster blast radius

Active multi-cluster blast radius is the reliability property obtained by running N independent orchestration clusters active- active, each receiving a 1/N share of every service's real traffic. The key invariant: a cluster-scoped failure (or operator error) damages at most 1/N of in-flight requests, not the full fleet.

The usual setup:

  • N ≥ 2 clusters, disjoint control planes (each has its own API server, CoreDNS, etcd, scheduler, etc.).
  • Each service is deployed identically into every cluster.
  • Traffic splits approximately evenly — typically via DNS weights, a regional edge LB, or client-side per-cluster endpoint rotation.
  • Cluster-scoped operations (config rollout, upgrade, patch) proceed cluster-by-cluster so at any instant N-1 clusters are healthy.

What cluster-scoped failures look like

  • API-server / etcd outage in one cluster → new pods don't schedule there; existing pods keep serving if the data plane is independent.
  • CoreDNS wedged or recreated. Figma's concrete incident: operator destroyed + recreated CoreDNS on one production cluster. On a single-cluster topology this would have been a full outage. On 3-cluster active-active it cost 1/3 of request traffic, and most downstream callers recovered via retries against the other 2/3.
  • Bad cluster config (NetworkPolicy, cluster-scoped operator, cluster-wide resource limit) → blast radius ≤ 1/N.
  • Cluster upgrade / restart. Roll the upgrade through clusters one at a time; the rest keep serving.

What it doesn't help with

  • Code/image bugs — deployed identically to every cluster, they break everywhere. The cluster split is about orchestration-layer failure modes; application-layer rollouts still need staged rollout.
  • Shared dependencies outside the cluster (databases, S3, upstream APIs). If all clusters talk to one RDS instance, that's a single failure domain.
  • User sessions with cross-cluster state. Long-lived websockets pinned to one cluster, user-scoped sticky-routing — if the user's cluster dies, their session dies with it. Usually addressed by making sessions reconnect-tolerant rather than by routing around the failure.

Contrast with AZ redundancy

AZ redundancy is often bundled with cluster redundancy but they're orthogonal axes:

  • Multi-AZ single cluster handles AZ-scoped failures (an AZ blackout) but a cluster-scope bug (e.g. Figma's CoreDNS destruction) kills everything.
  • Active multi-cluster handles cluster-scoped bugs. Each cluster should also be multi-AZ for the AZ axis.

Requests-per-AZ and requests-per-cluster are multiplicatively independent blast-radius levers.

Cost tradeoff

  • Triple control-plane cost. Each cluster runs its own API server, etcd, system-namespace pods (Kyverno, CoreDNS, CNI, etc.).
  • More scaling-floor overhead. System pods have a minimum footprint per cluster, not per service.
  • Operational complexity. Tooling has to operate across N clusters (see the Figma post-migration tooling-UX regression where users had to specify cluster name on every command, addressed by auto-inferring).

Figma's judgment: the reliability gain (and proof by incident) justified the tax.

Seen in

  • sources/2024-08-08-figma-migrated-onto-k8s-in-less-than-12-months — three active EKS clusters per environment. Operations proceed cluster-by-cluster. The CoreDNS destruction incident is the explicit proof point for the topology's value. Tooling-UX regressions from the N-cluster topology surfaced post-migration and were addressed with auto-inference.
Last updated · 200 distilled / 1,178 read