Skip to content

SYSTEM Cited by 1 source

VDiff

What it is

VDiff (vitess.io docs) is Vitess's zero-downtime consistency checker for data-motion workflows. Given an active VReplication workflow, VDiff verifies that every row that should have been copied or replicated actually landed on the destination shards, correctly filtered per the sharding scheme and with values matching the source. It is the explicit pre-cutover confidence gate on any VReplication-based migration, resharding, or table move.

Why it exists

Zero-downtime migrations span hours, days, or even weeks at petabyte scale and involve many streams across many tablets on fleet hardware. Every step of VReplication is fault-tolerant, but even so: "at least one [VDiff] is run before the cutover to ensure that the data has been copied correctly and that the new system is in sync with the old." (Source: sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale.) See patterns/vdiff-verify-before-cutover for why verify- before-cutover is the canonical pre-switch discipline for any long-running replication pipeline.

How it works

Per-table diff (serial across tables within a workflow):

  1. Lock the workflow — acquire a named lock on the workflow in the target keyspace's topology server so the workflow can't be concurrently manipulated while VDiff initialises.
  2. Stop the workflow for table-diff initialisation.
  3. Consistent snapshot on source — exactly as in VReplication's copy phase: START TRANSACTION WITH CONSISTENT SNAPSHOT on the source tablet, record the resulting GTID position. See concepts/consistent-non-locking-snapshot.
  4. Per-target-shard START REPLICA UNTIL-equivalent — each target shard's stream is started until it has reached the source's captured GTID position, then stops. On each target shard, open a consistent snapshot. At this point source + all target shards hold consistent snapshots of the table at exactly the same logical time.
  5. Restart the workflow — replication resumes applying new source events to the target shards while the diff scans from the frozen snapshots on each side.
  6. Release the workflow lock.
  7. Concurrent full-table scan on source and each target shard, comparing streamed results and noting any discrepancies (missing rows, mismatched values). Diff state persisted in the VDiff sidecar tables on each target shard.

The diff reports to the user on completion (see VDiff show): ETA, rows compared, any discrepancies with detail, and status per table.

Key properties

  • Zero downtime on the source production system. "The VDiff will choose REPLICA tablets by default on the source and target, for the data streaming (the work is still orchestrated by and the state still stored on the target PRIMARY tablets), to prevent any impact on the live production system." (Source: sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale.)
  • Fault-tolerant and resumable. "It will automatically pick up where it left off if any error is encountered." Important at petabyte scale — see concepts/fault-tolerant-long-running-workflow.
  • Incremental / resumable over long cutover-preparation horizons. "If e.g. you are in the pre-cutover state for many weeks or even months, you can run an initial VDiff, and then resume that one as you get closer to the cutover point."
  • Disturbs the workflow minimally. The stop / snapshot / restart dance takes only as long as the snapshot-setup itself; the full diff scan runs concurrently with normal VReplication catch-up.

Why it shows up on this wiki

VDiff is the canonical wiki instance of a consistency verifier designed for zero-downtime data-motion workflows. The shape is reusable: lock the workflow → snapshot both sides at matching logical times → resume workflow → concurrently scan both sides → report. Any long-running replication pipeline (Debezium + target store, Postgres logical replication, AWS DMS, vendor-specific CDC pipelines) needs the same verification primitive before any cutover that trusts the destination for primary traffic. VDiff documents the architectural choice points: run on REPLICAs to avoid source-side load, persist state on target PRIMARY tablets, make it resumable from arbitrary failure points, and make incremental resumption the operational default so the verification cost amortises across the cutover-preparation horizon.

Seen in

  • sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale — canonical wiki description of VDiff's workflow-locking
  • source-snapshot + target-START REPLICA UNTIL + restart
  • release-lock + concurrent-scan mechanism, plus the REPLICA-tablet preference for zero source-side impact and the resumable / incremental operation mode. VDiff is named as the explicit pre-cutover verification step on every PlanetScale migration.
Last updated · 319 distilled / 1,201 read