Skip to content

PATTERN Cited by 1 source

Release train rollout with canary

Problem

Staged rollouts come in two shapes, each with a well-known failure mode:

  • Strictly sequential release train (version advances one stage at a time, always gated on the previous stage's success) — safe, but by the time a change reaches the head of the train, cumulative changes are being tested together. If the N-th stage detects an issue, bisecting to the specific change is harder, and the operational cost of revert-and-re-release is amplified.
  • Every-stage-in-parallel canary (all stages get the new version simultaneously, with one stage flagged "canary") — fast, but the canary has no real "production workload specialisation"; a defect that only manifests under production-specific load or data is already everywhere by the time it's detected.

Solution

Split the stages into two roles:

  • One stage (the "canary") runs at the head of the line — it receives every new version as soon as it's available, regardless of whether the previous version has completed the rest of the train. The canary is the fast-feedback stage designed to catch issues close to when they were introduced.
  • The remaining stages run as a strict release train — version V only advances into stage 2 once version V has successfully completed all N stages (i.e., prod-N). The train is the slow, safety-padded rollout that bounds blast-radius over time.

The canary is not at the head of the train — it's a parallel fast path that receives new versions regardless of the train's state. The rest of the train only advances when the previous version has made it all the way through.

Slack's instantiation

The canonical wiki instance is Slack's phase-2 Chef rollout (Source: sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption):

  • Slack's six production environments (prod-1 through prod-6, split by AZ — see patterns/split-environment-per-az-for-blast-radius) are the stages.
  • prod-1 is the canary: "prod-1 is treated as our canary production environment. It receives updates every hour (assuming there have been new changes merged into the cookbooks) to ensure that changes are regularly exercised in a real production setting."
  • prod-2 through prod-6 are the release train: "Changes to prod-2 through prod-6 are rolled out using a release train model. Before a new version is deployed to prod-2, we ensure the previous version has successfully progressed through all production environments up to prod-6."

Why the canary is not at the head of the train

Verbatim motivation:

"If we waited for a version to pass through all environments before updating prod-1 — as we do with prod-2 onward — we'd end up testing artifacts with large, cumulative changes. By contrast, updating prod-1 frequently with the latest version allows us to detect issues closer to when they were introduced."

The canary-at-head-of-train variant would give: - V1 → canary → (whole train) → done ... meanwhile changes V2, V3, V4, V5 accumulate ... V5 → canary → carries ALL of V2, V3, V4, V5 changes at once.

Making the canary a parallel fast path gives: - V1 → canary (hour 1) → V1 → train (advances hourly). - V2 → canary (hour 2). Still hot-testing V2 close to when the V1 → V2 diff was created. - V3 → canary (hour 3). And so on — each canary run tests a smaller, incremental diff.

Numbers and cadence

Slack discloses: - Canary (prod-1) cadence: hourly (assuming new changes). - Release-train advancement cadence: :30 past the hour. - Train length: prod-2prod-6 — 5 stages after the canary, advancing one per hour when new versions are available. - Full rollout latency: if prod-1 gets version X at hour N, prod-2 gets X at hour N+0 (:30 same hour), then prod-3 at hour N+1:30, … prod-6 at hour N+4:30.

Between prod-2 and prod-6, the train adds ~4-5 hours of propagation latency — the time during which each AZ has a slightly different version. This latency is the blast-radius reduction. The whole fleet moves slower, but an incident in one AZ doesn't propagate to the other AZs until after the incident window closes.

Why this shape is load-bearing

  • Fast detection (canary) — issues surface within 1 hour of the change being cut.
  • Bounded propagation (release train) — production fleet rollout takes multiple hours; a defect caught in prod-2 or prod-3 doesn't affect prod-4, prod-5, or prod-6.
  • Clear revert path (previous version is always in the environment pin, not garbage-collected).

Alternative: strict release train (no parallel canary)

The simpler shape — every stage strictly sequential, the first stage is "canary" — gives smaller blast-radius but larger cumulative-change risk. Works best when: - Change frequency is low (each advance carries only a small diff anyway). - Train depth is small (fewer intermediate stages to propagate through). - Revert cost is low (rolling back cumulative changes is not painful).

Slack's high-frequency cookbook-change environment plus six-stage train length is exactly the regime where the parallel-canary-plus-train hybrid pays off.

Alternative: cohort-percentage canary

An orthogonal rollout pattern (see patterns/cohort-percentage-rollout) is to gate progression by percentage of fleet (1% → 10% → 50% → 100%) with the canary being the 1% cohort. This composes with the environment-based shape: each environment in the release train could itself be rolled out at 1%/10%/etc. within its nodes.

Composes with

Caveats

  • Assumes change frequency justifies the canary. If cookbook changes happen weekly, a hourly canary is over-engineered. Slack's approach fits a regime where cookbooks change multiple times per day.
  • Canary hourly cadence assumes low-enough observation signal latency. If the metric that catches issues takes hours to aggregate, a hourly canary fires before the previous hour's signal is fully in.
  • Release-train width (1 concurrent version) can cause backpressure. A rollout stuck in prod-3 means prod-2 can't accept a new version even if one is ready — a deliberate trade-off, but operators may be tempted to override and ship two versions in parallel through the train.
  • Asymmetry is operational debt. Two different rollout regimes (canary hourly + train at 5-6-hour cadence) means two different runbooks, two different alert thresholds, and a split-brain state where prod-1 is on version V+N and prod-6 is on V. Operational dashboards must show version-per-environment to make the state legible.

Seen in

Last updated · 470 distilled / 1,213 read