PATTERN Cited by 1 source

Progressive cluster rollout¶

Problem¶

In a multi-cluster fleet where each cluster is a failure domain for a different tenant workload, how do you roll out a change (config, binary, operator version) without exposing the riskiest / highest- criticality tenants first? A flat "all clusters at once" rollout is the failure mode; a sequenced rollout that goes from low-criticality to high-criticality is the answer.

The shape¶

Sequence cluster rollouts by tenant criticality, not by cluster size or alphabetical order. The sequence is:

Test clusters — synthetic traffic only, no real tenants. First place a change lands. Any crash-on-boot or config-parse bug surfaces here.
Internal clusters — engineering / platform-team tenants. Real traffic, but the audience is the platform team itself; if something breaks, they're the ones on-call for it and can revert quickly.
Application clusters — product-engineering tenants' services. If these break, product teams are affected, but the observability signal for the product still mostly flows (because infra clusters are still healthy).
Infrastructure clusters — the clusters carrying the highest- criticality workloads (network / compute / mesh instrumentation). If these break, you're "flying blind" — the signal about why things are breaking is missing. Last to roll out, with the longest soak time in front of them.

At any stage, a regression detected should stop the rollout before reaching the next tier.

Why sequence by criticality, not size¶

Size-first rollouts ("smallest cluster first") minimize the number of tenants affected by a single regression, but say nothing about the severity of an outage. A small infrastructure cluster taking down observability is worse than a large application cluster dropping a few dashboards.
Criticality-first rollouts minimize the MTTR cost of a regression: a test-cluster regression has zero impact; an infrastructure-cluster regression blinds the entire company. Paying the MTTR cost later, with more signal, is the right trade-off.

Supports a >99.9% availability target¶

Airbnb cites this pattern as one of the mechanisms that let them achieve >99.9% availability with multi-cluster federation in their metrics storage system — the rollout sequence buys enough learning time between tiers to catch regressions before they reach the tiers that would dominate the availability metric. (Source: sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system)

Pre-requisites¶

Tenant-cluster mapping is known — you need to be able to enumerate which tenants live in which cluster, and classify each cluster by workload criticality.
Automated deployment — manually sequencing clusters by hand is slow and error-prone. Airbnb uses Grafana's Kubernetes rollout operators to coordinate multi-AZ StatefulSet rolls per cluster.
Observability on the change itself — you need signal about whether the rollout is healthy in each tier before moving on.

When to deviate¶

Security patches may need a different sequencing that prioritizes exposed surface, not criticality.
Emergency hotfixes for an active incident may need to go directly to the affected cluster, bypassing the tier sequence.
Schema migrations that require ordering (e.g., readers-first, writers-second) overlay a separate sequencing on top.

Seen in¶

sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system — Airbnb rolls out metrics-storage changes through test → internal → application → infrastructure clusters. Tiered sequencing + automated Kubernetes rollout operators = >99.9% availability target.