Skip to content

CONCEPT Cited by 1 source

Migration job lifecycle

Definition

A migration job lifecycle is the per-job state machine governing where a job is in a multi-phase migration between two parallel implementations of the same logical pipeline. Each phase is gated by explicit promotion criteria; each phase transition is reversible via demotion so that transient failures don't trap a job in a broken state.

"Our first step was to establish a clear migration job lifecycle to ensure data integrity and operational reliability throughout the process. Each job needed to be verified for correctness and had to meet defined success criteria before moving to the next step of the migration lifecycle." — Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale

Canonical phases

Meta's data-ingestion-system migration uses a three-phase lifecycle (Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale):

Phase What happens Production-table writer
Shadow Phase New-system job runs in pre-production environment, consumes same source as production, writes to separate shadow table Old system
Reverse Shadow Phase Writes swap — new system writes production table; old system writes shadow table New system
Migration Cleanup Old-system job (now writing shadow table) is removed entirely New system (sole writer)

Promotion criteria

Each phase transition is gated by machine-checkable criteria. The Meta lifecycle uses three (Source):

  1. No data quality issues — row count + checksum match between old and new system's outputs (see concepts/data-quality-checksum-comparison).
  2. No landing latency regression — new system delivers data on time at minimum, ideally faster.
  3. No resource utilization regression — compute + storage are equal or better than the legacy job.

For "the critical table migration" additional team-specific criteria are negotiated.

Demotion is structural, not exceptional

The lifecycle is bidirectional: a job that was promoted to a later phase can be automatically demoted if its criteria stop being met (e.g. a transient regression appears after promotion). This is what allows the lifecycle to be a continuous-control-loop primitive (patterns/automated-job-lifecycle-promotion) rather than a one-way state machine — without demotion, transient failures would require manual intervention to unstick.

Why it scales to tens of thousands of jobs

The lifecycle's value is abstracting per-job migration risk into a universal state machine — once the criteria are stable, the gating logic is the same for every job, and the only per-job work is fixing whatever is keeping a job out of its next phase. The automated promotion loop then drives every job through the lifecycle in parallel, with the operator surface (system-level + job-level dashboards) showing only the jobs that are stuck or regressing.

  • vs parallel run: parallel run keeps the old system as authoritative throughout; this lifecycle has phases where the new system is authoritative (Reverse Shadow Phase) before the cleanup.
  • vs patterns/notion-double-write-backfill-verify-switchover: Notion's pattern uses one writer dual-writing to two stores; this lifecycle has two separate writers, each writing to one store, with the writer-to-table assignment swapping at phase transitions.
  • vs canary deployment: canary controls exposure of one binary to a fraction of traffic; migration job lifecycle controls which-system writes which-table across the entire job population.

Seen in

Last updated · 542 distilled / 1,571 read