CONCEPT Cited by 1 source
Shadow cluster¶
Definition¶
A shadow cluster is a parallel cluster running a candidate release of a system, which receives a mirror of production traffic so that its behaviour can be compared against the current production release before the candidate is promoted.
It is a sibling of the canary: canary gets a slice of real prod traffic, shadow gets a copy of all prod traffic but does not return results to users. The failure modes each catches are different:
- Canary catches regressions that appear quickly at any traffic share.
- Shadow catches regressions that appear only at production scale or only on long-running workload that a canary cluster does not exercise.
The Meta Presto example¶
sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale describes Meta's use of a Shadow Presto cluster specifically to catch post-compilation regressions on long-running queries:
- New Presto builds first go to a Canary tier, which catches the majority of correctness / performance issues.
- For long-running queries "where performance/correctness regressions can only be determined after a lot of work is done", a Shadow Presto cluster runs alongside production.
- Production queries are mirrored to the Shadow cluster; the Shadow cluster runs the candidate release.
- Results produced by Shadow are compared to results from production for correctness.
- Performance counters and resource usage are compared as well.
Only when both Canary and Shadow signals are green does the candidate release graduate to the general fleet.
Trade-offs¶
- Cost. Shadow clusters double the compute for the workload they mirror. Teams usually narrow the mirror to a representative sample for long-running queries.
- Side-effect safety. Shadow must not write to the same externally-visible state as production; SELECT-only query engines like Presto make this easier than, e.g., OLTP databases.
- Result comparison is hard. Non-determinism (ordering, floating point, time-dependent predicates) forces the validator to compare semantically rather than byte-for-byte.
Seen in¶
- sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — Shadow Presto cluster for long-running query validation.
Related¶
- patterns/canary-and-shadow-cluster-rollout — the combined pattern.
- patterns/staged-rollout — the broader family.
- patterns/shadow-migration — migration-time sibling of the same shadow traffic idea.
- concepts/blast-radius — shadow + canary bound blast radius at release time.