Skip to content

PATTERN Cited by 1 source

Canary and shadow cluster rollout

Two-stage deployment pipeline for long-running fleet services: a canary tier catches correctness and performance regressions that surface quickly, and a parallel shadow cluster mirrors production traffic to catch regressions that only appear for long-running or post-compilation workloads. Both signals must be green before the new build is promoted fleet-wide.

Why two stages

A standard canary catches regressions visible in short queries or synthetic tests — compile errors, obvious latency spikes, crashes. It does not catch regressions that only surface in long-running queries (minute-scale or hour-scale), optimizer-plan changes that only bite specific query shapes, or slow memory leaks. A shadow cluster running the candidate build against mirrored production traffic catches exactly these: long-running post-compilation regressions where the canary percentile is green but a specific query family degrades.

Machinery required

  • Canary tier: a production-grade but small subset of the fleet running the candidate release. Catches most correctness / perf regressions on real traffic quickly.
  • Shadow cluster: a separate cluster running the candidate build, receiving a mirror of production queries. Results and resource usage are compared against production's; no customer traffic is affected by shadow-cluster output.
  • Promotion gate: both canary signals and shadow comparison must be green before the build goes fleet-wide.

Seen in

  • sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — Meta's Presto release pipeline. A new Presto release first rolls into a canary tier; in parallel, a shadow Presto cluster runs alongside production and receives mirrored queries. The shadow cluster's results are compared against production's for correctness, and its performance counters and resource usage are compared for regressions. Only when both canary and shadow signals are green does Meta promote the build to the general fleet. Captures regressions that surface only in long-running queries — queries only validated post-compilation — which a canary-alone pipeline misses.
Last updated · 517 distilled / 1,221 read