PATTERN Cited by 1 source
Progressive configuration rollout¶
Progressive configuration rollout is the same staged-deployment discipline usually applied to code — canary → small cohort → large cohort → fleet, gated by health signals at each step, with automated rollback on regression — applied to configuration changes: service topologies, routing tables, feature flags, firewall rules, ACLs, addressing.
It is the configuration-plane analogue of patterns/staged-rollout. Staged rollout bounds the blast radius of a bad build; progressive config rollout bounds the blast radius of a bad config value — and for control planes that drive BGP / DNS / routing, a bad config value can be far more damaging, far faster, than a bad binary.
The pattern¶
- Describe changes structurally, not by hard-coded endpoint lists. When a change is a single logical edit to a well-typed abstraction ("this service runs in these regions"), the control plane can decide where and in what order to apply it. When the change is a hard-coded IP-and-DC list, the change is the deployment, and all you can do is hit "apply."
- Stage by failure domain. Canary at one POP / one availability zone / one rack before fanning out to a region, then globally. For addressing / BGP changes this is especially load-bearing because the reach-multiplier is extreme.
- Evaluate per stage with real health signals. Latency, error rates, BGP route visibility, prefix reachability from external vantage points, resolver QPS — whatever the control plane actually influences. Without per-stage evaluation the rollout is just a time delay.
- Automated rollback on regression. The rollout orchestrator should detect a regression and revert without paging a human. patterns/fast-rollback is the reverse path.
- Emergency bypass. For incidents where the orchestrator itself is too slow (BGP withdrawal in progress, every second counts), there must be a documented emergency path. Cloudflare used this on 2025-07-14: the revert was first validated in a testing location, then accelerated beyond the normal progressive schedule to restore service.
Why control-plane config needs this more than code does¶
Two asymmetries:
- Propagation speed. A bad code push is gated by a CI / CD / release pipeline — minutes to hours from merge to fleet. A bad config value on a global control plane can be effectively instantaneous: one orchestrator evaluation → one BGP withdrawal → global traffic drop within seconds.
- Latent-activation risk. Code tends to fail on first execution. Config — especially referentially-linked config — can be dormant for weeks and activate on an unrelated trigger. Progressive rollout of the trigger event then becomes the checkpoint that would have caught the original bug.
Canonical wiki instance¶
sources/2025-07-16-cloudflare-1111-incident-on-july-14-2025 — Cloudflare's post-mortem explicitly calls out the absence of this pattern on the legacy addressing system as the root structural cause:
This model also has a significant flaw in that updates to the configuration do not follow a progressive deployment methodology: Even though this release was peer-reviewed by multiple engineers, the change didn't go through a series of canary deployments before reaching every Cloudflare data center. Our newer approach is to describe service topologies without needing to hard-code IP addresses, which better accommodate expansions to new locations and customer scenarios while also allowing for a staged deployment model, so changes can propagate slowly with health monitoring.
The stated remediation is not "add more review" on the legacy system but deprecate the legacy system and migrate onto the strategic one that supports progressive deployment natively.
When it's hard¶
- Large-scale atomic updates. BGP advertisement consistency across a fleet sometimes wants to be atomic (you can't half- announce a prefix). Staged rollouts have to be designed so each stage is itself a consistent state, not a half-state.
- Dual-system migrations. During a migration from legacy to strategic, the legacy system still exists and changes through it don't stage. The longer the migration, the longer you keep the non-progressive path as an active risk surface.
- Global mesh dependencies. Some config changes don't have a "small cohort" — they're intrinsically global (e.g. a DNS zone change). For those, pre-production equivalents (test zones, test topologies) become the canary surface. Cloudflare's "manually triggered action is validated in testing locations before being executed" is this shape.
Relationship to adjacent patterns¶
- patterns/staged-rollout — sibling pattern for code deployments; same discipline, different artefact.
- patterns/fast-rollback — the reverse path that makes progressive rollout safe; paired always.
- patterns/ab-test-rollout — % rollouts gated by business metrics rather than health signals; useful for feature flags but not for addressing / routing config.
Seen in¶
- sources/2025-07-16-cloudflare-1111-incident-on-july-14-2025 — absence of this pattern on the legacy addressing system is the root structural cause of the 62-minute global 1.1.1.1 outage; the stated remediation is to move every addressing change onto the strategic system that supports progressive deployment.