PATTERN Cited by 1 source

Shadow migration (dual-run reconciliation)¶

Shadow migration (a.k.a. dual-run with reconciliation) is the pattern of running the new engine in parallel with the old, feeding both the same inputs, producing both outputs, and reconciling results before any consumer sees the new engine's output. It is the canonical risk-reduction pattern for migrating business-critical data pipelines between engines.

Shape¶

Manual shadow — pick a small representative subset; run the new engine against the same inputs as the old. Compare outputs.
Automated shadow — wire the new engine into production alongside the old; every job runs in both engines simultaneously. Compare outputs continuously. Keep the old engine's output authoritative throughout.
Reconciliation — two layers:
Dataset-level statistical equivalence — record counts, cardinalities, min/max/avg/distribution per column. Good for "did we lose any rows? did any column's distribution shift?"
Real-query reconciliation — run actual consumer queries (across multiple consumer engines, not just one) against both outputs, compare results. Good for "does a real reader see a difference?"
Subscriber switchover — once shadow agrees and production confidence is earned, move individual consumers off the old engine with the freedom to reverse course per-consumer (patterns/subscriber-switchover).

Cost side¶

You temporarily double the compute cost of the pipeline during the shadow phase. This is a real operational expense that must be budgeted, not wished away.
Amazon BDT: "If everything crashed and burned at this point, it would become a capital loss for the business and filed away as a hard lesson learned."

Why byte-for-byte is the wrong bar¶

At TB/PB scale, queries run on different compute frameworks against the same data almost never produce byte-equal results:

Decimal rounding differences
Non-deterministic execution plans (unstable sort)
Adding/removing metadata in Parquet files
Different value-overflow/underflow handling
Type-comparison opinions: -0 == 0? timezone-aware vs -naive timestamp equality?
Pre-Gregorian calendar date interpretation drift
Leap-second handling

Trying to enforce byte-for-byte equivalence is a false bar that will burn the migration team chasing non-bugs. Statistical + real-query equivalence is the pragmatic bar that survives real cross-framework comparison.

Amazon BDT's instantiation¶

On the Spark → Ray migration:

DQ Service (Ray-based) for dataset-level statistical equivalence — counts, cardinalities, min/max/avg, Parquet feature parity (e.g. Bloom filters present iff both sides produced them).
Data Reconciliation Service for real-query equivalence — ran queries through systems/amazon-redshift, systems/apache-spark, and systems/amazon-athena against both outputs.
Progression: manual shadow (2022) → automated 1:1 shadow (2023) → subscriber switchover (2024).

(Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

patterns/subscriber-switchover — the cutover pattern shadow migration earns you the right to do.
patterns/dual-write-migration — the writes-side dual.
patterns/achievable-target-first-migration — the complement pattern of picking the first migration target that you can actually finish.
concepts/copy-on-write-merge — the workload in Amazon BDT's case.

Seen in¶

sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — canonical instantiation across a multi-year migration on a business-critical pipeline, with explicit progression from manual shadow to automated 1:1 shadow to switchover, and explicit statistics-over-bytes reconciliation bar.