Skip to content

CONCEPT Cited by 1 source

Reverse shadow phase

Definition

The reverse shadow phase is the phase of a Shadow → Reverse Shadow → Cleanup migration in which the roles of the two jobs swap: the previously-shadow job is promoted to write the production table; the previously-production job is demoted to write the shadow table.

"In this phase, the shadow job's data was written to the production table, effectively making the shadow job the new production job. Meanwhile, the production job's data was written to the shadow table, so the original production job then acted as the shadow job." — Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale

Two structural benefits of swap-not-side-by-side

"This approach provided two key benefits. First, we could still get ongoing data-quality signals after rollout by continuing to compare outputs from the two systems. Second, we could roll back fast if discrepancies were detected, without needing to recreate or reconfigure the old system job." — Source

Each is structurally important:

  1. Continuous post-rollout signal. Cutting the old job entirely at rollout time would mean any post-rollout issue is detected only by consumer-side signal — slow, noisy, and confidence- destroying. Keeping the old job running but redirected to the shadow table preserves the side-by-side comparison primitive that the shadow phase relied on.
  2. Same-shape rollback. The old job is still alive — already-running, already-configured, already-tested. Rollback is just swap the writers back, not rebuild the old job from scratch. The reverse-shadow shape collapses rollback time by orders of magnitude because the rollback target is hot, not cold.

The bidirectional-backfill verification trick

During reverse shadow, Meta also runs backfill on both jobs to get early signals before consumer impact:

"To get the early signals, we triggered backfill on both production and shadow jobs. If the backfill results still matched it indicated the migration is successful. If the result did not match, the job would be rolled back immediately and data consumers would not be impacted."

This is a synthetic-load correctness test running in parallel to live traffic — both jobs replay history; both should produce identical output.

  • vs canary deployment: canary exposes new code to a fraction of traffic; reverse shadow exposes new code to 100% of source-side ingest but keeps the old code running on the shadow output as a continuous correctness check.
  • vs classic rollback to previous binary: classic rollback requires the previous binary to be redeployed before traffic flips back; reverse shadow's rollback path is already deployed and running — only the table-assignment changes.
  • vs dual-write to two sinks: dual-write has one writer writing to both sinks; reverse shadow has two separate writers each writing to one sink, with the sink-assignment swapping per phase. Different fault model: if the writer crashes in dual-write, both sinks lose updates; if a writer crashes in reverse shadow, only its assigned sink loses updates.

Containment of CDC bad-data propagation

Reverse shadow plus dual-direction backfill plus partition-level quality marking together bound the propagation horizon of any data-quality bug introduced at rollout: a divergence shows up in the live comparison and in the backfill-vs-backfill comparison, the bad partition is marked, the bleeding stops, and the system rolls back to the old job (still running on the shadow table) without consumer impact.

Seen in

Last updated · 542 distilled / 1,571 read