Skip to content

PATTERN Cited by 3 sources

Staged rollout

Progressively roll out a change — code, config, feature flag — starting in a limited scope and expanding only if health signals stay green. Explicitly bounds blast radius of bad changes; tightly paired with fast-rollback.

Typical dimensions

  • Environment: dev → staging → production.
  • Geography / failure domain: one AWS zone, then region, then global.
  • Host/pod percentage: 1% → 10% → 50% → 100% of Kubernetes pods.
  • User slice: specific tenants, internal employees, cohort IDs.
  • Time: automatic progression on a schedule (cron rollout) vs human-gated.

Required machinery

  • Per-stage evaluation. Health signals (errors, latency, business metrics) must be observable at the rollout granularity; otherwise the rollout is just a delay.
  • Notification on regression. Author + stakeholders get paged when a stage flags.
  • Rollback trigger. Either automatic on regression or one-click human-initiated.
  • Rollout orchestrator. A control-plane component that tracks state across stages; not something services manage themselves.

Seen in

  • sources/2026-02-18-airbnb-sitar-dynamic-configuration — Airbnb Sitar makes staged rollout a first-class feature of its config platform. Control plane decides env, AWS zone, K8s pod %, and progression; each stage evaluates regressions, notifies stakeholders, and can fast-roll-back. Teams can pick automatic, manual, or cron rollout strategies.
  • sources/2025-07-16-cloudflare-1111-incident-on-july-14-2025absence-of-pattern instance: the legacy Cloudflare addressing system that triggered the 07-14 1.1.1.1 outage explicitly lacks progressive deployment — peer-reviewed changes ship to every data center at once. Blog's stated remediation is to deprecate the legacy system in favour of the strategic one that supports staged deployment with health monitoring. See the configuration- plane specialisation patterns/progressive-configuration-rollout.
  • sources/2026-01-19-cloudflare-what-came-first-the-cname-or-the-a-recordpattern-present-but-defect-invisible instance: the systems/cloudflare-1-1-1-1-resolver|1.1.1.1 CNAME-ordering change was progressively rolled out (2025-12-02 code commit → 2025-12-10 test environment → 2026-01-07 23:48 UTC global release start → 2026-01-08 17:40 UTC 90% fleet) but the defect was not observable at any pre-90% checkpoint because the affected client population (glibc getaddrinfo consumers for hostnames with partially-expired CNAME chains, plus three models of Cisco Catalyst switches configured to use 1.1.1.1) was small and uncorrelated with POP selection — so per-stage health metrics passed clean until fleet-wide coverage made the aggregate breakage visible. Companion lesson: staged rollouts catch crashes that scale with traffic volume but not the subset of client-implementation-specific defects where the broken population is small and uniformly-distributed. Pair with patterns/test-the-ambiguous-invariant as the first line of defence.
Last updated · 200 distilled / 1,178 read