PATTERN Cited by 1 source
Multi-cluster active-active redundancy¶
Multi-cluster active-active redundancy is the deployment shape of running N independent Kubernetes (or equivalent) clusters simultaneously, each receiving a 1/N share of every service's real traffic. All clusters are primary — there's no passive standby. Reduces blast radius of cluster-scoped failures to 1/N.
The canonical modern instantiation is Figma's three active EKS clusters per environment (2024 ECS→EKS migration).
Shape¶
- N disjoint clusters, each a full deployment of every service. Each has its own control plane (EKS / self-hosted API server), system pods (CoreDNS, Kyverno, CNI), and worker node pool.
- Traffic split mechanism — typically weighted DNS (1/N per cluster's endpoint), an edge LB with weighted routing, or client libraries with per-cluster endpoint rotation.
- Cluster-by-cluster operations. Any cluster-scoped change (upgrade, system-pod config, NetworkPolicy) rolls out one cluster at a time. At any moment, N-1 clusters remain healthy.
- Build out from one cluster, not to it. Don't start with 1 and add clusters later after traffic is committed; start with N, so the data plane has been shaped for N from day one.
Failure modes it handles¶
- Cluster-scoped operator errors (Figma's cited case: operator destroyed + recreated CoreDNS on one cluster → 1/3 of requests affected, most recovered via retry against the other 2/3).
- API-server / etcd degradation — affects scheduling on that cluster, not the serving clusters.
- Cluster upgrade / patch — rolled through cluster-by-cluster.
- Cluster-wide resource limits, admission-controller bugs, CNI regressions, ingress-controller regressions — all become 1/N-blast events.
Failure modes it does not handle¶
- Application-code bugs. Shipped identically to every cluster; needs patterns/staged-rollout independently.
- Shared state outside the cluster (single RDS, single S3 bucket, upstream API). Cluster topology doesn't replicate the database.
- Long-lived stateful sessions pinned to one cluster — user sessions on websockets or similar. Usually the answer is make sessions reconnect-tolerant, not pin them.
Tradeoffs¶
- + Reliability via blast-radius reduction, proven by Figma's CoreDNS incident.
- + Safer upgrade procedures — cluster upgrades are no longer global events.
- + Independent blast-radius axis on top of AZ redundancy.
- − N× control-plane and system-pod cost. Each cluster independently runs CoreDNS, Kyverno, metrics collectors, etc.
- − Added pipeline complexity. Deploys, config, observability all need cluster-awareness.
- − Tooling UX. Figma's post-migration pain: users had to specify cluster on every CLI command until tooling added auto-inference. A predictable regression worth planning for.
Contrast with blue/green¶
- patterns/blue-green-service-mesh-migration is a migration pattern — two environments exist transiently; the goal is to move all traffic to green and tear down blue.
- Multi-cluster active-active is a standing reliability pattern — all N clusters are permanent and steady-state.
- Both split traffic at the edge (DNS / ALB / CDN); different intent.
During Figma's ECS→EKS migration, both patterns were layered: the migration cut from ECS to EKS (effectively blue/green across substrates), and the EKS side was simultaneously N=3 for ongoing reliability.
Seen in¶
- sources/2024-08-08-figma-migrated-onto-k8s-in-less-than-12-months — three EKS clusters per environment, all active. Scoped in as a one-way-door decision during the migration — retrofit later would have been a second migration. CoreDNS destruction incident post- cutover cited as the proof point.
Related¶
- concepts/active-multi-cluster-blast-radius — the reliability property this pattern realizes
- patterns/weighted-dns-traffic-shifting — the typical traffic split mechanism
- patterns/staged-rollout — complementary axis (code/config rollouts across the N clusters)
- patterns/blue-green-service-mesh-migration — contrast