CONCEPT Cited by 1 source
Active multi-cluster blast radius¶
Active multi-cluster blast radius is the reliability property obtained by running N independent orchestration clusters active- active, each receiving a 1/N share of every service's real traffic. The key invariant: a cluster-scoped failure (or operator error) damages at most 1/N of in-flight requests, not the full fleet.
The usual setup:
- N ≥ 2 clusters, disjoint control planes (each has its own API server, CoreDNS, etcd, scheduler, etc.).
- Each service is deployed identically into every cluster.
- Traffic splits approximately evenly — typically via DNS weights, a regional edge LB, or client-side per-cluster endpoint rotation.
- Cluster-scoped operations (config rollout, upgrade, patch) proceed cluster-by-cluster so at any instant N-1 clusters are healthy.
What cluster-scoped failures look like¶
- API-server / etcd outage in one cluster → new pods don't schedule there; existing pods keep serving if the data plane is independent.
- CoreDNS wedged or recreated. Figma's concrete incident: operator destroyed + recreated CoreDNS on one production cluster. On a single-cluster topology this would have been a full outage. On 3-cluster active-active it cost 1/3 of request traffic, and most downstream callers recovered via retries against the other 2/3.
- Bad cluster config (NetworkPolicy, cluster-scoped operator, cluster-wide resource limit) → blast radius ≤ 1/N.
- Cluster upgrade / restart. Roll the upgrade through clusters one at a time; the rest keep serving.
What it doesn't help with¶
- Code/image bugs — deployed identically to every cluster, they break everywhere. The cluster split is about orchestration-layer failure modes; application-layer rollouts still need staged rollout.
- Shared dependencies outside the cluster (databases, S3, upstream APIs). If all clusters talk to one RDS instance, that's a single failure domain.
- User sessions with cross-cluster state. Long-lived websockets pinned to one cluster, user-scoped sticky-routing — if the user's cluster dies, their session dies with it. Usually addressed by making sessions reconnect-tolerant rather than by routing around the failure.
Contrast with AZ redundancy¶
AZ redundancy is often bundled with cluster redundancy but they're orthogonal axes:
- Multi-AZ single cluster handles AZ-scoped failures (an AZ blackout) but a cluster-scope bug (e.g. Figma's CoreDNS destruction) kills everything.
- Active multi-cluster handles cluster-scoped bugs. Each cluster should also be multi-AZ for the AZ axis.
Requests-per-AZ and requests-per-cluster are multiplicatively independent blast-radius levers.
Cost tradeoff¶
- Triple control-plane cost. Each cluster runs its own API server, etcd, system-namespace pods (Kyverno, CoreDNS, CNI, etc.).
- More scaling-floor overhead. System pods have a minimum footprint per cluster, not per service.
- Operational complexity. Tooling has to operate across N clusters (see the Figma post-migration tooling-UX regression where users had to specify cluster name on every command, addressed by auto-inferring).
Figma's judgment: the reliability gain (and proof by incident) justified the tax.
Seen in¶
- sources/2024-08-08-figma-migrated-onto-k8s-in-less-than-12-months — three active EKS clusters per environment. Operations proceed cluster-by-cluster. The CoreDNS destruction incident is the explicit proof point for the topology's value. Tooling-UX regressions from the N-cluster topology surfaced post-migration and were addressed with auto-inference.
Related¶
- patterns/multi-cluster-active-active-redundancy
- patterns/staged-rollout — complementary axis (cluster topology reduces blast radius of cluster-scoped failures; staged rollout reduces blast radius of application-code bugs)