Skip to content

PATTERN Cited by 1 source

Three-mode rollout — off / shadow / exec

Intent

Roll out a new component that sits on the critical path — especially a control-plane or request-routing component — through a three-position configuration flag rather than a two-position on/off flag. The middle "shadow" position lets operators run the new component alongside the old one and diff their outputs before any traffic depends on the new component.

The three modes

  1. Off (False / legacy / disabled) — the existing production path is unchanged. New component is not deployed, or deployed-but-inert.
  2. Shadow (Pre / dark / pre-processing) — new component runs in the cluster in parallel with the existing path. It does the same work the new path will do in production (fetching, parsing, computing outputs), but production traffic still flows through the old path. Its output is exposed for inspection — typically via an HTTP endpoint — so operators can diff the new output against the old.
  3. Exec (Exec / live / enabled) — the new component is promoted to the production critical path. The old path is bypassed or retired.

Canonical instance

Zalando's Route Server rollout (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster). The routesrv mode config flag has three values — False / Pre / Exec — controlled by a single entry in the cluster's config-defaults.yaml:

  • FalseSkipper polls the Kubernetes API directly (the pre-migration production path).
  • Preroutesrv is running and computing its own routing table in parallel. Skipper is still polling the API directly. Both routing tables are exposed:
curl 'http://127.0.0.1:9911/routes?limit=...&nopretty' > skipper_routes.eskip
curl 'http://127.0.0.1:9090/routes' > routesrv_routes.eskip
git diff --no-index -- skipper_routes.eskip routesrv_routes.eskip

Operators confirm git diff is empty cluster by cluster before advancing. - Exec — Skipper fetches its routing table from routesrv. Direct-API polling is disabled. Production now depends on routesrv.

Clusters advanced False → Pre → Exec tier by tier (test clusters first, bake 2 weeks; then production tiers). Result: zero downtime, zero GMV loss.

Why Pre is the non-negotiable middle step

The shadow mode is what makes the Exec cutover de-risked. It lets the team commit route-table equivalence — "the new component produces exactly the same output as the old" — as a checked precondition to cutover, rather than a hope. When the diff is non-empty, the discrepancy is diagnosable without any production impact. From the post:

Remember, if our routing table is broken for some reason, we will have a downtime. That's why we had to be extra cautious and check any small difference in the routing table across all clusters.

A two-mode (off/on) rollout has no safe place to catch "works in test, breaks in prod" divergences.

Shape requirements for the new component

  • Pure / idempotent "compute" step that can run in parallel. Route Server's polling + parsing produces a routing table; it doesn't mutate upstream state. If the new component has side effects (writes to a shared store, emits externally visible events), shadow mode is harder and may require pointing the side effects at a test sink.
  • Observable output — the output has to be accessible for diffing. HTTP endpoints, metrics, log fingerprints, or written files all work. Route Server's /routes endpoint is the canonical shape.
  • Deterministic enough for diff — pure inputs to pure outputs given the same snapshot. If the new component is non-deterministic (timestamps, generation counters, hash salts) the diff needs canonicalisation.

Operational properties

  • Each mode transition is a config-file change, not a code deploy. Promotion cadence is decoupled from the deploy pipeline.
  • Rollback is to the previous mode (ExecPreFalse) — and because False still has all the old paths intact, the rollback is genuinely safe.
  • Pairs with tiered cluster rollout: promote mode per cluster category (test → staging → prod tier-1 → prod tier-2 → …) with a bake period at each step.

When to apply

  • Component sits on the critical path (control plane, proxy, router, auth, routing decisions).
  • Output is observable and comparable to the existing system's output.
  • Stakes of silent divergence are high (revenue, SLO, correctness).
  • A two-mode dark-launch or canary can't easily distinguish "works" from "works the same".

When not to apply

  • Component output is not comparable (e.g. new-user-facing UI; performance-only changes where the diff is a latency distribution, not a string diff).
  • Running two copies in parallel is economically infeasible (very large stateful workloads).
Last updated · 501 distilled / 1,218 read