Skip to content

PATTERN Cited by 2 sources

Snapshot plus catch-up replication

Problem

Copying a live database's data to another system without taking the source offline has a fundamental tension:

  • A consistent copy requires a fixed point-in-time view of the source — normally achieved by blocking writes.
  • Not taking the source offline requires not blocking writes — which means data can change during the copy.

The naive solutions don't work:

  • Serialise everything on the source for hours — the source is effectively offline.
  • Copy rows one-by-one and hope — the destination contains rows from N different logical times; any cross-row invariant (foreign keys, derived indexes, ACID observations taken during the copy) is broken.
  • Retry from scratch on any write — does not converge.

Solution

Capture a consistent non-locking snapshot plus the replication position that corresponds to it, copy the snapshot's rows to the destination, and tail the source's change log from the captured position to catch up with any writes that happened during (and since) the copy.

The pattern has three phases:

1. Snapshot + position capture

Atomically open a snapshot and record the replication position that corresponds to it. The source remains fully available for writes during this step (or is briefly read-locked for milliseconds depending on the engine).

  • MySQLLOCK TABLES <tbl> READ + START TRANSACTION WITH CONSISTENT SNAPSHOT + SELECT @@global.GTID_EXECUTED + UNLOCK TABLES. Captured state: consistent snapshot + GTID set.
  • PostgreSQLBEGIN + SET TRANSACTION ISOLATION LEVEL REPEATABLE READ + pg_export_snapshot() + pg_current_wal_lsn() or replication slot's LSN. Captured state: snapshot token
  • WAL LSN.
  • MongoDB — causal-consistency session + oplog timestamp.

2. Row copy

Stream rows from the snapshot to the destination. Common optimisations:

  • Order rows by primary key so the source reads off the clustered index without a filesort.
  • Parallelise across tables (one stream per table or per destination shard).
  • Checkpoint per-key progress in durable storage so the copy is resumable from any interruption.
  • Interleave row-copy cycles with change-log catch-up cycles (Vitess vreplication_copy_phase_duration) to stay inside the source's binlog / WAL retention horizon — a hard constraint for long copies.

3. Change-log catch-up

Once the row copy completes, tail the source's change log from the captured position to bring the destination current. This phase runs indefinitely until cutover.

  • MySQLCOM_BINLOG_DUMP_GTID with the captured GTID set; stream binlog events from that position forward.
  • PostgreSQL — logical replication slot consumer; stream WAL-derived change-set from the captured LSN.
  • MongoDB — tail the oplog from the captured timestamp.

Canonical wiki instance

Vitess VReplication — used under the hood by MoveTables for all PlanetScale customer migrations. Each per-target-shard stream:

  1. Takes a consistent non-locking snapshot of each table in sequence, records the source GTID.
  2. Streams rows out of the snapshot ordered by primary key, filtered to the destination shard per the sharding scheme, persisting per-key progress in the copy_state sidecar table.
  3. Interleaves copy with COM_BINLOG_DUMP_GTID catch-up cycles to stay inside binlog retention.
  4. On copy completion, transitions to continuous replication from the captured GTID, persisting the advancing GTID on every commit.

(Source: sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale.)

The post explicitly warns against skipping the catch-up interleave: "this regular catchup step is important to ensure that we don't complete the row copy only to then be unable to replicate from where we left off because the source MySQL instance no longer has binary log events that we need, as they have been purged, in which case we would be forced to start the entire migration over again."

Composes with

Seen in

  • sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale — canonical wiki instance of the pattern at petabyte scale, on MySQL, via Vitess VReplication. The full choreography is documented — LOCK TABLES READ briefly, consistent-snapshot, GTID capture, unlock, PK-ordered copy, per-stream destination shard filter, copy/catch-up interleave, post-copy continuous-replication via COM_BINLOG_DUMP_GTID with persisted GTID progress. Matt Lord's framing is explicit that this is the load-bearing pattern for any zero-downtime data motion at scale.

  • sources/2026-04-21-planetscale-behind-the-scenes-how-schema-reverts-work — the within-table, online-DDL application of the same pattern. Guevara + Noach walk the forward direction (consistent-snapshot batches + GTID capture + binlog catch-up applied to a shadow table being schema-evolved) at table scope. The post makes explicit that the snapshot-plus-catch-up shape applies regardless of whether the "copy" is moving data between keyspaces (the 2026-02-16 case) or between tables in the same keyspace (the online-DDL case). Extends the pattern family with the PlanetScale-specific observation that the snapshot-plus-catch-up stream does not have to terminate at cut-over — patterns/instant-schema-revert-via-inverse-replication re-uses the stream, now running in reverse, to keep a second table (the inverse shadow) continuously in sync, so a schema revert is a second swap of two tables that are already caught up.

Last updated · 319 distilled / 1,201 read