CONCEPT Cited by 2 sources

Active multi-cluster blast radius¶

Active multi-cluster blast radius is the reliability property obtained by running N independent orchestration clusters active- active, each receiving a 1/N share of every service's real traffic. The key invariant: a cluster-scoped failure (or operator error) damages at most 1/N of in-flight requests, not the full fleet.

The usual setup:

N ≥ 2 clusters, disjoint control planes (each has its own API server, CoreDNS, etcd, scheduler, etc.).
Each service is deployed identically into every cluster.
Traffic splits approximately evenly — typically via DNS weights, a regional edge LB, or client-side per-cluster endpoint rotation.
Cluster-scoped operations (config rollout, upgrade, patch) proceed cluster-by-cluster so at any instant N-1 clusters are healthy.

What cluster-scoped failures look like¶

API-server / etcd outage in one cluster → new pods don't schedule there; existing pods keep serving if the data plane is independent.
CoreDNS wedged or recreated. Figma's concrete incident: operator destroyed + recreated CoreDNS on one production cluster. On a single-cluster topology this would have been a full outage. On 3-cluster active-active it cost 1/3 of request traffic, and most downstream callers recovered via retries against the other 2/3.
Bad cluster config (NetworkPolicy, cluster-scoped operator, cluster-wide resource limit) → blast radius ≤ 1/N.
Cluster upgrade / restart. Roll the upgrade through clusters one at a time; the rest keep serving.

What it doesn't help with¶

Code/image bugs — deployed identically to every cluster, they break everywhere. The cluster split is about orchestration-layer failure modes; application-layer rollouts still need staged rollout.
Shared dependencies outside the cluster (databases, S3, upstream APIs). If all clusters talk to one RDS instance, that's a single failure domain.
User sessions with cross-cluster state. Long-lived websockets pinned to one cluster, user-scoped sticky-routing — if the user's cluster dies, their session dies with it. Usually addressed by making sessions reconnect-tolerant rather than by routing around the failure.

Contrast with AZ redundancy¶

AZ redundancy is often bundled with cluster redundancy but they're orthogonal axes:

Multi-AZ single cluster handles AZ-scoped failures (an AZ blackout) but a cluster-scope bug (e.g. Figma's CoreDNS destruction) kills everything.
Active multi-cluster handles cluster-scoped bugs. Each cluster should also be multi-AZ for the AZ axis.

Requests-per-AZ and requests-per-cluster are multiplicatively independent blast-radius levers.

Cost tradeoff¶

Triple control-plane cost. Each cluster runs its own API server, etcd, system-namespace pods (Kyverno, CoreDNS, CNI, etc.).
More scaling-floor overhead. System pods have a minimum footprint per cluster, not per service.
Operational complexity. Tooling has to operate across N clusters (see the Figma post-migration tooling-UX regression where users had to specify cluster name on every command, addressed by auto-inferring).

Figma's judgment: the reliability gain (and proof by incident) justified the tax.

Seen in¶

sources/2024-08-08-figma-migrated-onto-k8s-in-less-than-12-months — three active EKS clusters per environment. Operations proceed cluster-by-cluster. The CoreDNS destruction incident is the explicit proof point for the topology's value. Tooling-UX regressions from the N-cluster topology surfaced post-migration and were addressed with auto-inference.
sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system — Airbnb's metrics storage fleet applies the same shape but to a stateful storage system rather than a stateless service layer: multiple clusters, dedicated clusters for specialised workloads (compute / mesh / application), failure in one does not cascade. Key distinction from the Figma shape: federation is explicit — a Promxy-based proxy presents a single logical Prometheus over N underlying clusters; cross-cluster queries carry a 5–10× cost tax (concepts/cross-cluster-federated-query-cost), so the multi-cluster choice is not free on the read path. Rollout sequenced by criticality (patterns/progressive-cluster-rollout) is the mechanism that delivers >99.9% availability.

patterns/multi-cluster-active-active-redundancy
patterns/progressive-cluster-rollout
patterns/workload-segregated-clusters
concepts/shuffle-sharding
concepts/cross-cluster-federated-query-cost
systems/airbnb-metrics-storage
patterns/staged-rollout — complementary axis (cluster topology reduces blast radius of cluster-scoped failures; staged rollout reduces blast radius of application-code bugs)