PATTERN Cited by 1 source

Multi-cluster active-active redundancy¶

Multi-cluster active-active redundancy is the deployment shape of running N independent Kubernetes (or equivalent) clusters simultaneously, each receiving a 1/N share of every service's real traffic. All clusters are primary — there's no passive standby. Reduces blast radius of cluster-scoped failures to 1/N.

The canonical modern instantiation is Figma's three active EKS clusters per environment (2024 ECS→EKS migration).

Shape¶

N disjoint clusters, each a full deployment of every service. Each has its own control plane (EKS / self-hosted API server), system pods (CoreDNS, Kyverno, CNI), and worker node pool.
Traffic split mechanism — typically weighted DNS (1/N per cluster's endpoint), an edge LB with weighted routing, or client libraries with per-cluster endpoint rotation.
Cluster-by-cluster operations. Any cluster-scoped change (upgrade, system-pod config, NetworkPolicy) rolls out one cluster at a time. At any moment, N-1 clusters remain healthy.
Build out from one cluster, not to it. Don't start with 1 and add clusters later after traffic is committed; start with N, so the data plane has been shaped for N from day one.

Failure modes it handles¶

Cluster-scoped operator errors (Figma's cited case: operator destroyed + recreated CoreDNS on one cluster → 1/3 of requests affected, most recovered via retry against the other 2/3).
API-server / etcd degradation — affects scheduling on that cluster, not the serving clusters.
Cluster upgrade / patch — rolled through cluster-by-cluster.
Cluster-wide resource limits, admission-controller bugs, CNI regressions, ingress-controller regressions — all become 1/N-blast events.

Failure modes it does not handle¶

Application-code bugs. Shipped identically to every cluster; needs patterns/staged-rollout independently.
Shared state outside the cluster (single RDS, single S3 bucket, upstream API). Cluster topology doesn't replicate the database.
Long-lived stateful sessions pinned to one cluster — user sessions on websockets or similar. Usually the answer is make sessions reconnect-tolerant, not pin them.

Tradeoffs¶

+ Reliability via blast-radius reduction, proven by Figma's CoreDNS incident.
+ Safer upgrade procedures — cluster upgrades are no longer global events.
+ Independent blast-radius axis on top of AZ redundancy.
− N× control-plane and system-pod cost. Each cluster independently runs CoreDNS, Kyverno, metrics collectors, etc.
− Added pipeline complexity. Deploys, config, observability all need cluster-awareness.
− Tooling UX. Figma's post-migration pain: users had to specify cluster on every CLI command until tooling added auto-inference. A predictable regression worth planning for.

Contrast with blue/green¶

patterns/blue-green-service-mesh-migration is a migration pattern — two environments exist transiently; the goal is to move all traffic to green and tear down blue.
Multi-cluster active-active is a standing reliability pattern — all N clusters are permanent and steady-state.
Both split traffic at the edge (DNS / ALB / CDN); different intent.

During Figma's ECS→EKS migration, both patterns were layered: the migration cut from ECS to EKS (effectively blue/green across substrates), and the EKS side was simultaneously N=3 for ongoing reliability.

Seen in¶

sources/2024-08-08-figma-migrated-onto-k8s-in-less-than-12-months — three EKS clusters per environment, all active. Scoped in as a one-way-door decision during the migration — retrofit later would have been a second migration. CoreDNS destruction incident post- cutover cited as the proof point.

concepts/active-multi-cluster-blast-radius — the reliability property this pattern realizes
patterns/weighted-dns-traffic-shifting — the typical traffic split mechanism
patterns/staged-rollout — complementary axis (code/config rollouts across the N clusters)
patterns/blue-green-service-mesh-migration — contrast