Skip to content

PATTERN Cited by 3 sources

Snapshot plus catch-up replication

Problem

Copying a live database's data to another system without taking the source offline has a fundamental tension:

  • A consistent copy requires a fixed point-in-time view of the source — normally achieved by blocking writes.
  • Not taking the source offline requires not blocking writes — which means data can change during the copy.

The naive solutions don't work:

  • Serialise everything on the source for hours — the source is effectively offline.
  • Copy rows one-by-one and hope — the destination contains rows from N different logical times; any cross-row invariant (foreign keys, derived indexes, ACID observations taken during the copy) is broken.
  • Retry from scratch on any write — does not converge.

Solution

Capture a consistent non-locking snapshot plus the replication position that corresponds to it, copy the snapshot's rows to the destination, and tail the source's change log from the captured position to catch up with any writes that happened during (and since) the copy.

The pattern has three phases:

1. Snapshot + position capture

Atomically open a snapshot and record the replication position that corresponds to it. The source remains fully available for writes during this step (or is briefly read-locked for milliseconds depending on the engine).

  • MySQLLOCK TABLES <tbl> READ + START TRANSACTION WITH CONSISTENT SNAPSHOT + SELECT @@global.GTID_EXECUTED + UNLOCK TABLES. Captured state: consistent snapshot + GTID set.
  • PostgreSQLBEGIN + SET TRANSACTION ISOLATION LEVEL REPEATABLE READ + pg_export_snapshot() + pg_current_wal_lsn() or replication slot's LSN. Captured state: snapshot token
  • WAL LSN.
  • MongoDB — causal-consistency session + oplog timestamp.

2. Row copy

Stream rows from the snapshot to the destination. Common optimisations:

  • Order rows by primary key so the source reads off the clustered index without a filesort.
  • Parallelise across tables (one stream per table or per destination shard).
  • Checkpoint per-key progress in durable storage so the copy is resumable from any interruption.
  • Interleave row-copy cycles with change-log catch-up cycles (Vitess vreplication_copy_phase_duration) to stay inside the source's binlog / WAL retention horizon — a hard constraint for long copies.

3. Change-log catch-up

Once the row copy completes, tail the source's change log from the captured position to bring the destination current. This phase runs indefinitely until cutover.

  • MySQLCOM_BINLOG_DUMP_GTID with the captured GTID set; stream binlog events from that position forward.
  • PostgreSQL — logical replication slot consumer; stream WAL-derived change-set from the captured LSN.
  • MongoDB — tail the oplog from the captured timestamp.

Canonical wiki instance

Vitess VReplication — used under the hood by MoveTables for all PlanetScale customer migrations. Each per-target-shard stream:

  1. Takes a consistent non-locking snapshot of each table in sequence, records the source GTID.
  2. Streams rows out of the snapshot ordered by primary key, filtered to the destination shard per the sharding scheme, persisting per-key progress in the copy_state sidecar table.
  3. Interleaves copy with COM_BINLOG_DUMP_GTID catch-up cycles to stay inside binlog retention.
  4. On copy completion, transitions to continuous replication from the captured GTID, persisting the advancing GTID on every commit.

(Source: .)

The post explicitly warns against skipping the catch-up interleave: "this regular catchup step is important to ensure that we don't complete the row copy only to then be unable to replicate from where we left off because the source MySQL instance no longer has binary log events that we need, as they have been purged, in which case we would be forced to start the entire migration over again."

Composes with

Seen in

  • sources/2025-03-18-redpanda-3-powerful-connectors-for-real-time-change-data-capturecanonical wiki disclosure of snapshot-plus-catch-up as the cross-engine CDC lifecycle that all four of Redpanda Connect's input connectors implement, varied by engine-specific offset boundary. Postgres uses an LSN via replication slot; MySQL uses a binlog position captured under a global read lock; MongoDB uses an oplog resume token; Spanner uses a change-stream partition-sequence token stored transactionally. Canonical verbatim: "Our connector reads the snapshot into Redpanda to capture the database's current state before transitioning to real-time change streaming." Extends the pattern family with intra-table parallel snapshot — splitting a single table or collection into chunks read concurrently during the snapshot phase, narrower than this pattern's existing "parallelise across tables" framing. Structural observation: every CDC source published on the wiki implements a per-engine variant of this pattern; the engine-specific detail is the mechanism of the transition-boundary marker, not the shape of the pattern itself.

  • — canonical wiki instance of the pattern at petabyte scale, on MySQL, via Vitess VReplication. The full choreography is documented — LOCK TABLES READ briefly, consistent-snapshot, GTID capture, unlock, PK-ordered copy, per-stream destination shard filter, copy/catch-up interleave, post-copy continuous-replication via COM_BINLOG_DUMP_GTID with persisted GTID progress. Matt Lord's framing is explicit that this is the load-bearing pattern for any zero-downtime data motion at scale.

  • sources/2026-04-21-planetscale-faster-backups-with-shardingcanonical backup-to-object-storage application of the pattern. PlanetScale's per-shard backup composes snapshot-plus-catchup in an inverted flow: the "snapshot" is the previous backup in S3/GCS (restored onto an ephemeral VTBackup instance), and the "catchup" reads binlog from the primary VTGate to bring the restored MySQL forward to a fresh checkpoint-in-time. The new backup is then taken off the caught-up instance and written back to object storage. Extends the pattern family with a new production shape canonicalised as dedicated-backup-instance-with-catchup-replication: every shard runs its own instance of the pattern in parallel, and the full backup choreography avoids streaming full-DB contents from the primary on every backup cycle. Measured production performance: 32-shard / 20 TB cluster = 1 h 39 min (~6.7 GB/s aggregate); 256-shard / ~230 TB cluster = 3 h 37 min (~35 GB/s aggregate).

  • sources/2026-04-21-planetscale-behind-the-scenes-how-schema-reverts-work — the within-table, online-DDL application of the same pattern. Guevara + Noach walk the forward direction (consistent-snapshot batches + GTID capture + binlog catch-up applied to a shadow table being schema-evolved) at table scope. The post makes explicit that the snapshot-plus-catch-up shape applies regardless of whether the "copy" is moving data between keyspaces (the 2026-02-16 case) or between tables in the same keyspace (the online-DDL case). Extends the pattern family with the PlanetScale-specific observation that the snapshot-plus-catch-up stream does not have to terminate at cut-over — patterns/instant-schema-revert-via-inverse-replication re-uses the stream, now running in reverse, to keep a second table (the inverse shadow) continuously in sync, so a schema revert is a second swap of two tables that are already caught up.
Last updated · 542 distilled / 1,571 read