PATTERN Cited by 3 sources
Staged rollout¶
Progressively roll out a change — code, config, feature flag — starting in a limited scope and expanding only if health signals stay green. Explicitly bounds blast radius of bad changes; tightly paired with fast-rollback.
Typical dimensions¶
- Environment: dev → staging → production.
- Geography / failure domain: one AWS zone, then region, then global.
- Host/pod percentage: 1% → 10% → 50% → 100% of Kubernetes pods.
- User slice: specific tenants, internal employees, cohort IDs.
- Time: automatic progression on a schedule (cron rollout) vs human-gated.
Required machinery¶
- Per-stage evaluation. Health signals (errors, latency, business metrics) must be observable at the rollout granularity; otherwise the rollout is just a delay.
- Notification on regression. Author + stakeholders get paged when a stage flags.
- Rollback trigger. Either automatic on regression or one-click human-initiated.
- Rollout orchestrator. A control-plane component that tracks state across stages; not something services manage themselves.
Seen in¶
- sources/2026-02-18-airbnb-sitar-dynamic-configuration — Airbnb Sitar makes staged rollout a first-class feature of its config platform. Control plane decides env, AWS zone, K8s pod %, and progression; each stage evaluates regressions, notifies stakeholders, and can fast-roll-back. Teams can pick automatic, manual, or cron rollout strategies.
- sources/2025-07-16-cloudflare-1111-incident-on-july-14-2025 — absence-of-pattern instance: the legacy Cloudflare addressing system that triggered the 07-14 1.1.1.1 outage explicitly lacks progressive deployment — peer-reviewed changes ship to every data center at once. Blog's stated remediation is to deprecate the legacy system in favour of the strategic one that supports staged deployment with health monitoring. See the configuration- plane specialisation patterns/progressive-configuration-rollout.
- sources/2026-01-19-cloudflare-what-came-first-the-cname-or-the-a-record
— pattern-present-but-defect-invisible instance: the
systems/cloudflare-1-1-1-1-resolver|1.1.1.1 CNAME-ordering
change was progressively rolled out (2025-12-02 code commit →
2025-12-10 test environment → 2026-01-07 23:48 UTC global release
start → 2026-01-08 17:40 UTC 90% fleet) but the defect was not
observable at any pre-90% checkpoint because the affected client
population (glibc
getaddrinfoconsumers for hostnames with partially-expired CNAME chains, plus three models of Cisco Catalyst switches configured to use 1.1.1.1) was small and uncorrelated with POP selection — so per-stage health metrics passed clean until fleet-wide coverage made the aggregate breakage visible. Companion lesson: staged rollouts catch crashes that scale with traffic volume but not the subset of client-implementation-specific defects where the broken population is small and uniformly-distributed. Pair with patterns/test-the-ambiguous-invariant as the first line of defence.
Related¶
- patterns/fast-rollback — the other half of staged rollouts
- patterns/progressive-configuration-rollout — the configuration-plane analogue of this pattern, for routing / addressing / feature-flag changes
- patterns/sidecar-agent — common enforcement point for pod-% rollouts
- patterns/cohort-percentage-rollout — specialisation for fleet-wide enforcement rollouts where per-device risk varies: sort the fleet by a telemetry-derived risk metric, advance cohorts lowest-to-highest-risk, pause at 98% for hardening, pair with a patterns/rollout-escape-hatch
- patterns/rollout-escape-hatch — per-user, time-bound, rollout-only revert path (retired at 100%); the individual-scope sibling to patterns/fast-rollback