PATTERN Cited by 1 source

Snapshot reuse from legacy during migration¶

Definition¶

The snapshot reuse from legacy during migration pattern is a CDC-specific optimisation for migrating between two CDC systems that share a source: reuse the legacy system's most recent snapshot output as the new system's seed snapshot, bypassing the slow + expensive first full-dump that the new system would otherwise have to run from the source.

"We also built creative solutions like reusing snapshot partitions delivered by the old system as snapshot initially to reduce the full dump load." — Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale

Why this matters: the full-dump tax¶

In a tri-layer CDC schema, a CDC job's first landed state is the full-dump snapshot — a periodic full scan of the source database. Meta calls out the cost explicitly:

"Due to the system's CDC design, a new job's first snapshot was landed via a full dump, which is typically slow and expensive."

Two distinct events trigger fresh full-dumps during migration:

First snapshot of a new job (one full-dump per migrated job).
Data-quality remediation if the first snapshot has issues that get fixed (another full-dump per affected job).

At tens of thousands of jobs, multiplied across a sometimes- needed-twice cycle, the aggregate full-dump cost is a load-bearing migration overhead — both compute-time on the new system and read-load on the source MySQL.

Why reuse from legacy works¶

Both systems are CDC systems against the same source MySQL. The legacy system has just produced a snapshot — the same source state at the same wall-clock moment that the new system would otherwise re-scan. Reusing that snapshot avoids:

The new system's full-source-scan compute cost.
The source MySQL's read-load for the duplicate scan.
The wall-clock latency of the duplicate scan (full-dumps are "slow").

The new system's CDC pipeline then starts from that seed and proceeds normally — applying deltas after the snapshot's timestamp to maintain its target table.

When this pattern is applicable¶

The structural prerequisites:

Both systems are CDC against the same source — the snapshot's contents have to be valid input for the new system's pipeline.
Snapshot semantics agree — both systems must agree on what the snapshot represents (e.g. consistent point-in-time state at timestamp T). If the systems' snapshot semantics differ subtly, the reused snapshot will produce diverging target tables.
The legacy system's snapshot is recent enough to make the saved scan worthwhile — if the snapshot is stale, the new system needs more delta-replay to catch up, which may exceed the saved full-dump cost.
The new system can ingest the legacy format — even with shared semantics, the on-disk format may differ. A transformation step or a shared snapshot format is the prerequisite.

Composes with¶

patterns/known-issue-exclusion-batch-selection — same spirit: minimise wasted full-dump cost by avoiding migrating jobs while known issues would force re-runs.
patterns/shadow-then-reverse-shadow-migration — the migration shape this optimisation lives inside; the seeded shadow job in phase 1 starts from the reused snapshot rather than its own fresh full-dump.

vs SELECT … INTO for cross-database copy: this is the same idea at the database layer; this pattern is the CDC-pipeline-layer instance that respects the new system's ingestion semantics.
vs bulk-restore from backup: backup-restore is for disaster recovery to the same system; this pattern is cross-system seeding for live migration.
vs dual-write start-of-time fork: dual-write can also bypass full-dumps by treating the dual-write start as the fork-point, but only works when both writers can run from the same start time; for migrations between CDC systems with different snapshot cadences, snapshot reuse provides a single-shot seeding primitive instead.

When NOT to use¶

The two systems have incompatible snapshot semantics — reusing produces diverging target tables.
The legacy system's snapshot format is too costly to translate — the translation overhead may exceed the saved full-dump cost.
Snapshots aren't available on a useful cadence — if the legacy system snapshots only quarterly but the migration is weekly-batch-paced, every reused snapshot is too stale.

Seen in¶

sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale — Meta's data-ingestion-system migration; canonical wiki instance avoiding the new system's first full-dump on thousands of migrated jobs.

concepts/full-dump-vs-delta-vs-target — the schema where the full-dump tax exists
concepts/change-data-capture — the substrate primitive
concepts/cdc-bad-data-propagation — the hazard the tri-layer schema is designed to bound
patterns/shadow-then-reverse-shadow-migration — the migration shape this lives inside
patterns/known-issue-exclusion-batch-selection — sibling optimisation
systems/meta-data-ingestion-system — canonical wiki instance
companies/meta — company hub