Skip to content

PATTERN Cited by 1 source

Phased rollout across release channels

Problem

An infrastructure change that is valid in CI and that passes schema validation and ChangeSet preview may still have unintended effects at runtime — side-effects only visible once the change actually executes. If that change is applied to every account or cluster in the fleet simultaneously, a bad change causes a fleet-wide incident before anyone can react.

The supertool failure mode makes this acute: the same tool that edits the config also executes the rollout, so a config bug that survives pre-deploy gates still hits every target at once.

Solution

Order all production targets into release channels — ordered groups of accounts or clusters — and enforce that every change graduate through every channel in order:

playground  →  test  →  infra  →  …  →  production

Each channel is a bucket of targets (AWS accounts, Kubernetes clusters) at an advancing level of operational criticality. A change applied to channel N is only allowed to advance to channel N+1 after the channel-N deploy is observed to succeed. A bug caught in channel 1 never reaches production.

Zalando's canonical instance

From the 2024-01 metadpata postmortem:

"Our Kubernetes cluster rollout already included a phased rollout to different groups of clusters. This idea was extended to our AWS infrastructure. The rollout process adopted by our tooling now includes gradual rollout to different release channels, each associated with a few AWS account categories (e.g. playground, test, infra). All changes must go through all release channels before getting to production. This approach allows us to gradually deploy changes to different accounts, ensuring a more controlled propagation that catches errors early on with a limited blast radius. The trade-off here is of course that the rollout takes a longer time."sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools

Two substrates using the same phased-rollout shape:

  1. Kubernetes cluster rollout (pre-existing at Zalando) — groups of clusters advanced through sequentially.
  2. AWS infrastructure rollout (new, post-incident) — release channels map to AWS account categories: playground, test, infra, (…further categories…), production.

The metadpata incident was the forcing function for extending the existing Kubernetes shape to AWS accounts.

Mechanism

1. Partition production targets into ordered channels
   [C1, C2, …, Cn], with C1 = lowest-blast-radius,
   Cn = production.
2. Assign every target (account/cluster) to exactly one
   channel.
3. On every change:
   a. Apply to C1.
   b. Observe for soak interval + health signals.
   c. If success, advance to C2. Otherwise, roll back.
   d. Repeat through Cn.
4. Any channel failure blocks further advancement.

The graduation criterion is the load-bearing design choice — pure timer vs manual approval vs automated health gate. Zalando's post names the trade-off ("takes a longer time") without disclosing the specific gate.

Substrate examples

Substrate Channels (examples)
Kubernetes clusters test clusters → staging clusters → prod clusters
AWS accounts playground → test → infra → production (Zalando)
Regions dev region → staging region → small prod region → all prod
Fleets canary fleet → pilot fleet → main fleet
SaaS tenants internal tenants → volunteer tenants → all tenants

The pattern is substrate-agnostic; what matters is the ordering by risk.

Trade-offs

  • Longer rollout time (named directly by Zalando). Rollout now takes ≥ channels × soak_interval.
  • Channel representativeness matters. If channel 1 is empty playground accounts with no production-class resources, passing channel 1 is weak evidence for prod safety.
  • More moving parts. Rollout tooling has to track per- channel state; the deployment pipeline is more complex.
  • Emergency bypass needed. A security patch may need to skip channels. Pattern must expose an explicit bypass rather than forcing teams to disable the tooling.
  • Intermediate-state bugs. A change can be valid during rollout but invalid at the boundary — e.g., a new resource in channel N calling an old API in channel N+1.

Prerequisites

  • Explicit channel membership — every target knows its channel.
  • Rollout orchestration — a tool that enforces the order automatically. If humans advance manually, the pattern degrades to "phased when we remember to."
  • Per-channel health signal — errors, alerts, metrics that can gate advancement.
  • Rollback path from each channel — if channel 2 fails, channel 1 can be rolled back.
  • Soak interval long enough to surface real issues.

When not to use it

  • Security patches under active exploitation — need fleet-wide apply, not phased.
  • Changes with no runtime effect (documentation, log messages) — overkill.
  • Fully stateless, easily-rollbackable service deploys — canary + auto-rollback may be enough without the full channel stack.

Composes with

Adjacent patterns

Seen in

Last updated · 501 distilled / 1,218 read