Skip to content

PATTERN Cited by 1 source

Shadow migration (dual-run reconciliation)

Shadow migration (a.k.a. dual-run with reconciliation) is the pattern of running the new engine in parallel with the old, feeding both the same inputs, producing both outputs, and reconciling results before any consumer sees the new engine's output. It is the canonical risk-reduction pattern for migrating business-critical data pipelines between engines.

Shape

  1. Manual shadow — pick a small representative subset; run the new engine against the same inputs as the old. Compare outputs.
  2. Automated shadow — wire the new engine into production alongside the old; every job runs in both engines simultaneously. Compare outputs continuously. Keep the old engine's output authoritative throughout.
  3. Reconciliation — two layers:
  4. Dataset-level statistical equivalence — record counts, cardinalities, min/max/avg/distribution per column. Good for "did we lose any rows? did any column's distribution shift?"
  5. Real-query reconciliation — run actual consumer queries (across multiple consumer engines, not just one) against both outputs, compare results. Good for "does a real reader see a difference?"
  6. Subscriber switchover — once shadow agrees and production confidence is earned, move individual consumers off the old engine with the freedom to reverse course per-consumer (patterns/subscriber-switchover).

Cost side

  • You temporarily double the compute cost of the pipeline during the shadow phase. This is a real operational expense that must be budgeted, not wished away.
  • Amazon BDT: "If everything crashed and burned at this point, it would become a capital loss for the business and filed away as a hard lesson learned."

Why byte-for-byte is the wrong bar

At TB/PB scale, queries run on different compute frameworks against the same data almost never produce byte-equal results:

  • Decimal rounding differences
  • Non-deterministic execution plans (unstable sort)
  • Adding/removing metadata in Parquet files
  • Different value-overflow/underflow handling
  • Type-comparison opinions: -0 == 0? timezone-aware vs -naive timestamp equality?
  • Pre-Gregorian calendar date interpretation drift
  • Leap-second handling

Trying to enforce byte-for-byte equivalence is a false bar that will burn the migration team chasing non-bugs. Statistical + real-query equivalence is the pragmatic bar that survives real cross-framework comparison.

Amazon BDT's instantiation

On the Spark → Ray migration:

  • DQ Service (Ray-based) for dataset-level statistical equivalence — counts, cardinalities, min/max/avg, Parquet feature parity (e.g. Bloom filters present iff both sides produced them).
  • Data Reconciliation Service for real-query equivalence — ran queries through systems/amazon-redshift, systems/apache-spark, and systems/amazon-athena against both outputs.
  • Progression: manual shadow (2022) → automated 1:1 shadow (2023) → subscriber switchover (2024).

(Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

Seen in

Last updated · 200 distilled / 1,178 read