Skip to content

PATTERN Cited by 1 source

VDiff verify before cutover

Problem

A zero-downtime migration has three parts: copy the source to the destination, keep the destination in sync via continuous replication, then cut over traffic. Each part can go silently wrong:

  • The copy phase can skip rows (filter-rule bug, primary- key-collision, charset normalisation), duplicate rows (interrupted restart bug), or apply wrong values (type coercion, collation mismatch).
  • The continuous replication can drop events (binlog filtering bug, row-filter bug), double-apply events (idempotency gap), or apply them out of order.
  • Cross-shard sharding schemes can mis-route rows so they land on the wrong destination shard.

The failures are silent by default — the replication pipeline returns success codes, the customer's production traffic continues, and nobody notices until the cutover puts the destination on the serving path and queries start returning wrong answers or missing data.

Cutting over without explicit verification is therefore a trust fall into the engineering of the migration pipeline — and that pipeline's engineering, even when high-quality, is subject to ambient sources of bug: MySQL version differences, collation changes, charset migrations, zero-date / timestamp semantics, edge cases in generated columns, and so on.

Solution

Run a full-table row-for-row consistency check between source and destination before any cutover, using snapshots that correspond to the same logical time on both sides, with the check itself designed to be zero-downtime and fault-tolerant.

The shape:

  1. Lock the replication workflow briefly so that the stream-state is stable across the diff-init dance.
  2. Snapshot both sides at matching logical times. On the source, take a concepts/consistent-non-locking-snapshot and capture the GTID / LSN. On each destination shard, advance the replication stream until it reaches that position, then stop and open a consistent snapshot.
  3. Release the workflow lock. Continuous replication resumes applying new source writes to the destination while the diff scans from the frozen snapshots.
  4. Concurrently full-table-scan source and destination, comparing streamed results row-by-row, recording every discrepancy with detail.
  5. Persist diff progress in durable storage so the diff is resumable from any interruption — at scale, the diff itself takes long enough to outlive individual tablet restarts.
  6. Use read replicas for the scans where possible so the diff does not itself impact live production load on either side.
  7. Report discrepancies in a form the customer can act on before cutting over.

Canonical wiki instance

Vitess VDiff — invoked as a required (or at least strongly recommended) step before every MoveTables SwitchTraffic cutover. PlanetScale positions VDiff as the confidence gate between the long-running replication phase and the cutover itself.

Properties observed:

  • Uses REPLICA tablets by default on both source and target for the row streaming — "to prevent any impact on the live production system."
  • Fault-tolerant and resumable"It will automatically pick up where it left off if any error is encountered." Diff state is persisted in VDiff sidecar tables on each target PRIMARY tablet.
  • Incremental over long cutover-prep horizons"if e.g. you are in the pre-cutover state for many weeks or even months, you can run an initial VDiff, and then resume that one as you get closer to the cutover point."
  • Reports ETA + discrepancies with detail so the customer can investigate and reconcile before cutover.

(Source: sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale.)

Composes with

Seen in

  • sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale — canonical wiki instance. Explicit mechanism description (workflow lock → source snapshot → per-shard START REPLICA UNTIL → target snapshots → workflow restart → lock release → concurrent full-table scan → persisted diff state → report). Framed as the confidence gate between continuous-replication phase and cutover. Noted to use REPLICA tablets by default to avoid source/ target production impact, and to be resumable / incremental across long cutover-preparation horizons.
Last updated · 319 distilled / 1,201 read