Skip to content

PATTERN Cited by 1 source

Shadow-then-reverse-shadow migration

Definition

The shadow-then-reverse-shadow migration pattern is a three-phase migration shape for two parallel implementations of the same CDC pipeline (legacy + new), where the production- table writer swaps between the two systems mid-migration:

  1. Shadow phase — new system runs in pre-production, writes to a separate shadow table; old system continues to write the production table.
  2. Reverse shadow phase — writes swap: new system writes the production table; old system, still running, writes the shadow table.
  3. Cleanup phase — old system (writing the shadow table) is removed.

"In the first step of the lifecycle we set up shadow jobs in the pre-production environment to be delivered via the new system.""Once the production job and the shadow job were running reliably in the production environment, we began the reverse shadow phase. In this phase, the shadow job's data was written to the production table, effectively making the shadow job the new production job. Meanwhile, the production job's data was written to the shadow table, so the original production job then acted as the shadow job.""If no discrepancies were detected, the shadow job, now running on the old system, was removed. The new system then took over and continued delivering data through the production job, marking the completion of the migration." — Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale

Why this shape

Two structural properties make this pattern uniquely well-fit for CDC-system migrations:

  1. Continuous post-rollout signal. After the swap, the old system keeps running on the shadow table — providing a continuous comparison reference against the now-authoritative new system. Any divergence is detected via row-count + checksum comparison without consumer impact.
  2. Hot rollback substrate. "We could roll back fast if discrepancies were detected, without needing to recreate or reconfigure the old system job." The old job is still alive, already-running, already-configured; rollback = swap the writers back, not rebuild the old job.

Together these address the CDC bad-data propagation hazard: a data-quality bug introduced at rollout shows up in the live comparison and is rollback-able without consumer disruption.

Three machine-checkable promotion criteria

Each phase transition is gated by:

  1. Data-quality match — row count + checksum identical between shadow and production tables.
  2. No landing-latency regression — new system delivers data on time at minimum, ideally faster.
  3. No resource-utilisation regression — compute + storage are equal-or-better than the legacy job.

For critical-table migrations, additional service-team-negotiated criteria apply.

Distinguishing from adjacent patterns

Pattern Production-table writer during migration Rollback substrate
Shadow-then-reverse-shadow (this) Swaps — old then new Old job still running on shadow table
Parallel run (Newman) Stays old Both systems running; new system never authoritative
Notion double-write Single writer to two stores, then switchover Reverse the dual-write direction
Dual-write migration Single writer to two stores Switch reads back to old store

The shape's distinguishing characteristic: two separate writers, each writing to one store, with the writer-to-store assignment swapping at phase transitions. Different fault model than dual-write — if a writer crashes in dual-write, both stores miss updates; here, only the assigned store misses updates.

When to use

  • Two parallel implementations of the same CDC pipeline must coexist temporarily (e.g. migration between two ingestion systems against the same source).
  • Source data is the same for both systems — the comparison primitive only works if both writers are reading the same source.
  • Bad-data propagation is a defining hazard of the pipeline (any CDC system) — the swap-and-keep-old-running shape is what gives you both ongoing signal and hot rollback.
  • Tens of thousands of jobs need to be migrated — combined with patterns/automated-job-lifecycle-promotion, this scales beyond any manually-gated migration shape.

When NOT to use

  • The two systems read from different sources — the comparison primitive doesn't apply.
  • The new system has different semantics (not a behavioural re-implementation but a new pipeline producing different data) — checksum comparison will fail by design.
  • Single-job, low-stakes migration — the overhead of running two systems in pre-production then in production is justified only when scale or stakes warrant it.
  • Strong consistency requirements between the two writers — this pattern allows transient divergence between shadow and production tables; if external consumers are reading both tables and expect cross-table consistency, that's broken.

Composes with

Seen in

Last updated · 542 distilled / 1,571 read