Skip to content

PATTERN Cited by 1 source

Progressive cluster rollout

Problem

In a multi-cluster fleet where each cluster is a failure domain for a different tenant workload, how do you roll out a change (config, binary, operator version) without exposing the riskiest / highest- criticality tenants first? A flat "all clusters at once" rollout is the failure mode; a sequenced rollout that goes from low-criticality to high-criticality is the answer.

The shape

Sequence cluster rollouts by tenant criticality, not by cluster size or alphabetical order. The sequence is:

  1. Test clusters — synthetic traffic only, no real tenants. First place a change lands. Any crash-on-boot or config-parse bug surfaces here.
  2. Internal clusters — engineering / platform-team tenants. Real traffic, but the audience is the platform team itself; if something breaks, they're the ones on-call for it and can revert quickly.
  3. Application clusters — product-engineering tenants' services. If these break, product teams are affected, but the observability signal for the product still mostly flows (because infra clusters are still healthy).
  4. Infrastructure clusters — the clusters carrying the highest- criticality workloads (network / compute / mesh instrumentation). If these break, you're "flying blind" — the signal about why things are breaking is missing. Last to roll out, with the longest soak time in front of them.

At any stage, a regression detected should stop the rollout before reaching the next tier.

Why sequence by criticality, not size

  • Size-first rollouts ("smallest cluster first") minimize the number of tenants affected by a single regression, but say nothing about the severity of an outage. A small infrastructure cluster taking down observability is worse than a large application cluster dropping a few dashboards.
  • Criticality-first rollouts minimize the MTTR cost of a regression: a test-cluster regression has zero impact; an infrastructure-cluster regression blinds the entire company. Paying the MTTR cost later, with more signal, is the right trade-off.

Supports a >99.9% availability target

Airbnb cites this pattern as one of the mechanisms that let them achieve >99.9% availability with multi-cluster federation in their metrics storage system — the rollout sequence buys enough learning time between tiers to catch regressions before they reach the tiers that would dominate the availability metric. (Source: sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system)

Pre-requisites

  • Tenant-cluster mapping is known — you need to be able to enumerate which tenants live in which cluster, and classify each cluster by workload criticality.
  • Automated deployment — manually sequencing clusters by hand is slow and error-prone. Airbnb uses Grafana's Kubernetes rollout operators to coordinate multi-AZ StatefulSet rolls per cluster.
  • Observability on the change itself — you need signal about whether the rollout is healthy in each tier before moving on.

When to deviate

  • Security patches may need a different sequencing that prioritizes exposed surface, not criticality.
  • Emergency hotfixes for an active incident may need to go directly to the affected cluster, bypassing the tier sequence.
  • Schema migrations that require ordering (e.g., readers-first, writers-second) overlay a separate sequencing on top.

Seen in

Last updated · 319 distilled / 1,201 read