PATTERN Cited by 1 source
VDiff verify before cutover¶
Problem¶
A zero-downtime migration has three parts: copy the source to the destination, keep the destination in sync via continuous replication, then cut over traffic. Each part can go silently wrong:
- The copy phase can skip rows (filter-rule bug, primary- key-collision, charset normalisation), duplicate rows (interrupted restart bug), or apply wrong values (type coercion, collation mismatch).
- The continuous replication can drop events (binlog filtering bug, row-filter bug), double-apply events (idempotency gap), or apply them out of order.
- Cross-shard sharding schemes can mis-route rows so they land on the wrong destination shard.
The failures are silent by default — the replication pipeline returns success codes, the customer's production traffic continues, and nobody notices until the cutover puts the destination on the serving path and queries start returning wrong answers or missing data.
Cutting over without explicit verification is therefore a trust fall into the engineering of the migration pipeline — and that pipeline's engineering, even when high-quality, is subject to ambient sources of bug: MySQL version differences, collation changes, charset migrations, zero-date / timestamp semantics, edge cases in generated columns, and so on.
Solution¶
Run a full-table row-for-row consistency check between source and destination before any cutover, using snapshots that correspond to the same logical time on both sides, with the check itself designed to be zero-downtime and fault-tolerant.
The shape:
- Lock the replication workflow briefly so that the stream-state is stable across the diff-init dance.
- Snapshot both sides at matching logical times. On the source, take a concepts/consistent-non-locking-snapshot and capture the GTID / LSN. On each destination shard, advance the replication stream until it reaches that position, then stop and open a consistent snapshot.
- Release the workflow lock. Continuous replication resumes applying new source writes to the destination while the diff scans from the frozen snapshots.
- Concurrently full-table-scan source and destination, comparing streamed results row-by-row, recording every discrepancy with detail.
- Persist diff progress in durable storage so the diff is resumable from any interruption — at scale, the diff itself takes long enough to outlive individual tablet restarts.
- Use read replicas for the scans where possible so the diff does not itself impact live production load on either side.
- Report discrepancies in a form the customer can act on before cutting over.
Canonical wiki instance¶
Vitess VDiff — invoked as a
required (or at least strongly recommended) step before
every MoveTables
SwitchTraffic cutover. PlanetScale positions VDiff as
the confidence gate between the long-running
replication phase and the cutover itself.
Properties observed:
- Uses
REPLICAtablets by default on both source and target for the row streaming — "to prevent any impact on the live production system." - Fault-tolerant and resumable — "It will
automatically pick up where it left off if any error is
encountered." Diff state is persisted in VDiff sidecar
tables on each target
PRIMARYtablet. - Incremental over long cutover-prep horizons — "if
e.g. you are in the pre-cutover state for many weeks or
even months, you can run an initial
VDiff, and then resume that one as you get closer to the cutover point." - Reports ETA + discrepancies with detail so the customer can investigate and reconcile before cutover.
(Source: sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale.)
Composes with¶
- patterns/snapshot-plus-catchup-replication — the pattern this is the verification-step on top of.
- patterns/routing-rule-swap-cutover — only flip routing rules after VDiff shows clean.
Seen in¶
- sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale
— canonical wiki instance. Explicit mechanism description
(workflow lock → source snapshot → per-shard
START REPLICA UNTIL→ target snapshots → workflow restart → lock release → concurrent full-table scan → persisted diff state → report). Framed as the confidence gate between continuous-replication phase and cutover. Noted to useREPLICAtablets by default to avoid source/ target production impact, and to be resumable / incremental across long cutover-preparation horizons.
Related¶
- concepts/consistent-non-locking-snapshot
- concepts/gtid-position
- concepts/online-database-import
- concepts/fault-tolerant-long-running-workflow
- systems/vitess-vdiff
- systems/vitess-vreplication
- systems/vitess-movetables
- patterns/snapshot-plus-catchup-replication
- patterns/routing-rule-swap-cutover
- companies/planetscale