Skip to content

PATTERN Cited by 1 source

Multi-cluster active-active redundancy

Multi-cluster active-active redundancy is the deployment shape of running N independent Kubernetes (or equivalent) clusters simultaneously, each receiving a 1/N share of every service's real traffic. All clusters are primary — there's no passive standby. Reduces blast radius of cluster-scoped failures to 1/N.

The canonical modern instantiation is Figma's three active EKS clusters per environment (2024 ECS→EKS migration).

Shape

  1. N disjoint clusters, each a full deployment of every service. Each has its own control plane (EKS / self-hosted API server), system pods (CoreDNS, Kyverno, CNI), and worker node pool.
  2. Traffic split mechanism — typically weighted DNS (1/N per cluster's endpoint), an edge LB with weighted routing, or client libraries with per-cluster endpoint rotation.
  3. Cluster-by-cluster operations. Any cluster-scoped change (upgrade, system-pod config, NetworkPolicy) rolls out one cluster at a time. At any moment, N-1 clusters remain healthy.
  4. Build out from one cluster, not to it. Don't start with 1 and add clusters later after traffic is committed; start with N, so the data plane has been shaped for N from day one.

Failure modes it handles

  • Cluster-scoped operator errors (Figma's cited case: operator destroyed + recreated CoreDNS on one cluster → 1/3 of requests affected, most recovered via retry against the other 2/3).
  • API-server / etcd degradation — affects scheduling on that cluster, not the serving clusters.
  • Cluster upgrade / patch — rolled through cluster-by-cluster.
  • Cluster-wide resource limits, admission-controller bugs, CNI regressions, ingress-controller regressions — all become 1/N-blast events.

Failure modes it does not handle

  • Application-code bugs. Shipped identically to every cluster; needs patterns/staged-rollout independently.
  • Shared state outside the cluster (single RDS, single S3 bucket, upstream API). Cluster topology doesn't replicate the database.
  • Long-lived stateful sessions pinned to one cluster — user sessions on websockets or similar. Usually the answer is make sessions reconnect-tolerant, not pin them.

Tradeoffs

  • + Reliability via blast-radius reduction, proven by Figma's CoreDNS incident.
  • + Safer upgrade procedures — cluster upgrades are no longer global events.
  • + Independent blast-radius axis on top of AZ redundancy.
  • − N× control-plane and system-pod cost. Each cluster independently runs CoreDNS, Kyverno, metrics collectors, etc.
  • − Added pipeline complexity. Deploys, config, observability all need cluster-awareness.
  • − Tooling UX. Figma's post-migration pain: users had to specify cluster on every CLI command until tooling added auto-inference. A predictable regression worth planning for.

Contrast with blue/green

  • patterns/blue-green-service-mesh-migration is a migration pattern — two environments exist transiently; the goal is to move all traffic to green and tear down blue.
  • Multi-cluster active-active is a standing reliability pattern — all N clusters are permanent and steady-state.
  • Both split traffic at the edge (DNS / ALB / CDN); different intent.

During Figma's ECS→EKS migration, both patterns were layered: the migration cut from ECS to EKS (effectively blue/green across substrates), and the EKS side was simultaneously N=3 for ongoing reliability.

Seen in

Last updated · 200 distilled / 1,178 read