PATTERN Cited by 1 source
Three-mode rollout — off / shadow / exec¶
Intent¶
Roll out a new component that sits on the critical path — especially a control-plane or request-routing component — through a three-position configuration flag rather than a two-position on/off flag. The middle "shadow" position lets operators run the new component alongside the old one and diff their outputs before any traffic depends on the new component.
The three modes¶
- Off (
False/ legacy / disabled) — the existing production path is unchanged. New component is not deployed, or deployed-but-inert. - Shadow (
Pre/ dark / pre-processing) — new component runs in the cluster in parallel with the existing path. It does the same work the new path will do in production (fetching, parsing, computing outputs), but production traffic still flows through the old path. Its output is exposed for inspection — typically via an HTTP endpoint — so operators can diff the new output against the old. - Exec (
Exec/ live / enabled) — the new component is promoted to the production critical path. The old path is bypassed or retired.
Canonical instance¶
Zalando's Route Server rollout (Source:
sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster).
The routesrv mode config flag has three values —
False / Pre / Exec — controlled by a single entry in
the cluster's config-defaults.yaml:
False— Skipper polls the Kubernetes API directly (the pre-migration production path).Pre— routesrv is running and computing its own routing table in parallel. Skipper is still polling the API directly. Both routing tables are exposed:
curl 'http://127.0.0.1:9911/routes?limit=...&nopretty' > skipper_routes.eskip
curl 'http://127.0.0.1:9090/routes' > routesrv_routes.eskip
git diff --no-index -- skipper_routes.eskip routesrv_routes.eskip
Operators confirm git diff is empty cluster by cluster
before advancing.
- Exec — Skipper fetches its routing table from
routesrv. Direct-API polling is disabled. Production now
depends on routesrv.
Clusters advanced False → Pre → Exec tier by tier (test
clusters first, bake 2 weeks; then production tiers). Result:
zero downtime, zero GMV loss.
Why Pre is the non-negotiable middle step¶
The shadow mode is what makes the Exec cutover
de-risked. It lets the team commit route-table equivalence
— "the new component produces exactly the same output as the
old" — as a checked precondition to cutover, rather than a
hope. When the diff is non-empty, the discrepancy is
diagnosable without any production impact. From the post:
Remember, if our routing table is broken for some reason, we will have a downtime. That's why we had to be extra cautious and check any small difference in the routing table across all clusters.
A two-mode (off/on) rollout has no safe place to catch "works in test, breaks in prod" divergences.
Shape requirements for the new component¶
- Pure / idempotent "compute" step that can run in parallel. Route Server's polling + parsing produces a routing table; it doesn't mutate upstream state. If the new component has side effects (writes to a shared store, emits externally visible events), shadow mode is harder and may require pointing the side effects at a test sink.
- Observable output — the output has to be accessible
for diffing. HTTP endpoints, metrics, log fingerprints,
or written files all work. Route Server's
/routesendpoint is the canonical shape. - Deterministic enough for diff — pure inputs to pure outputs given the same snapshot. If the new component is non-deterministic (timestamps, generation counters, hash salts) the diff needs canonicalisation.
Operational properties¶
- Each mode transition is a config-file change, not a code deploy. Promotion cadence is decoupled from the deploy pipeline.
- Rollback is to the previous mode (
Exec→Pre→False) — and becauseFalsestill has all the old paths intact, the rollback is genuinely safe. - Pairs with tiered cluster rollout: promote mode per cluster category (test → staging → prod tier-1 → prod tier-2 → …) with a bake period at each step.
When to apply¶
- Component sits on the critical path (control plane, proxy, router, auth, routing decisions).
- Output is observable and comparable to the existing system's output.
- Stakes of silent divergence are high (revenue, SLO, correctness).
- A two-mode dark-launch or canary can't easily distinguish "works" from "works the same".
When not to apply¶
- Component output is not comparable (e.g. new-user-facing UI; performance-only changes where the diff is a latency distribution, not a string diff).
- Running two copies in parallel is economically infeasible (very large stateful workloads).
Related¶
- patterns/shadow-mode-alert-before-paging — running alerts in shadow mode before wiring them to pagers; same shape applied to monitoring.
- concepts/shadow-mode-alert-validation — the concept behind shadow-mode observation in general.
- patterns/control-plane-proxy-with-etag-cache — the pattern this rollout shape was used to ship.
- systems/zalando-route-server — canonical instance.