PATTERN Cited by 1 source
Progressive cluster rollout¶
Problem¶
In a multi-cluster fleet where each cluster is a failure domain for a different tenant workload, how do you roll out a change (config, binary, operator version) without exposing the riskiest / highest- criticality tenants first? A flat "all clusters at once" rollout is the failure mode; a sequenced rollout that goes from low-criticality to high-criticality is the answer.
The shape¶
Sequence cluster rollouts by tenant criticality, not by cluster size or alphabetical order. The sequence is:
- Test clusters — synthetic traffic only, no real tenants. First place a change lands. Any crash-on-boot or config-parse bug surfaces here.
- Internal clusters — engineering / platform-team tenants. Real traffic, but the audience is the platform team itself; if something breaks, they're the ones on-call for it and can revert quickly.
- Application clusters — product-engineering tenants' services. If these break, product teams are affected, but the observability signal for the product still mostly flows (because infra clusters are still healthy).
- Infrastructure clusters — the clusters carrying the highest- criticality workloads (network / compute / mesh instrumentation). If these break, you're "flying blind" — the signal about why things are breaking is missing. Last to roll out, with the longest soak time in front of them.
At any stage, a regression detected should stop the rollout before reaching the next tier.
Why sequence by criticality, not size¶
- Size-first rollouts ("smallest cluster first") minimize the number of tenants affected by a single regression, but say nothing about the severity of an outage. A small infrastructure cluster taking down observability is worse than a large application cluster dropping a few dashboards.
- Criticality-first rollouts minimize the MTTR cost of a regression: a test-cluster regression has zero impact; an infrastructure-cluster regression blinds the entire company. Paying the MTTR cost later, with more signal, is the right trade-off.
Supports a >99.9% availability target¶
Airbnb cites this pattern as one of the mechanisms that let them achieve >99.9% availability with multi-cluster federation in their metrics storage system — the rollout sequence buys enough learning time between tiers to catch regressions before they reach the tiers that would dominate the availability metric. (Source: sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system)
Pre-requisites¶
- Tenant-cluster mapping is known — you need to be able to enumerate which tenants live in which cluster, and classify each cluster by workload criticality.
- Automated deployment — manually sequencing clusters by hand is slow and error-prone. Airbnb uses Grafana's Kubernetes rollout operators to coordinate multi-AZ StatefulSet rolls per cluster.
- Observability on the change itself — you need signal about whether the rollout is healthy in each tier before moving on.
When to deviate¶
- Security patches may need a different sequencing that prioritizes exposed surface, not criticality.
- Emergency hotfixes for an active incident may need to go directly to the affected cluster, bypassing the tier sequence.
- Schema migrations that require ordering (e.g., readers-first, writers-second) overlay a separate sequencing on top.
Seen in¶
- sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system — Airbnb rolls out metrics-storage changes through test → internal → application → infrastructure clusters. Tiered sequencing + automated Kubernetes rollout operators = >99.9% availability target.