Skip to content

PATTERN Cited by 1 source

Reverse replication for rollback

Problem

A well-prepared zero-downtime database migration (data copied, continuous replication live, VDiff clean) can still go wrong after cutover for reasons that only manifest under real production traffic:

  • Query-plan regressions the query optimiser makes differently on the new engine version.
  • Lock contention patterns that only emerge at production concurrency.
  • Collation / charset / generated-column differences that don't show up in test workloads.
  • Connection-pool behaviour that doesn't match the old system.
  • Customer-specific load patterns the new system hasn't seen before.

The customer discovers the regression in the first minutes of post-cutover production traffic — when the old system is already strictly stale because it stopped accepting writes at cutover time. The only way out is to reconcile the diverged states manually, which at petabyte scale is effectively a brand-new migration project.

The cutover has become a one-way door.

Solution

At cutover, create a reverse replication workflow (target → source) and keep it running until the customer explicitly finalises the migration. The reverse workflow applies every write landing on the new system back to the old system, so the old system stays current. If the new system misbehaves, traffic can be cut back to the old system — which now has every write that happened post- cutover — without data loss.

The cutover is now a revolving door: the customer can switch back and forth as many times as needed, and call Complete only when they are 100% confident in the new system.

Mechanics

At cutover time (in Vitess this is MoveTables SwitchTraffic):

  1. Before flipping routing rules, ensure there are viable PRIMARY tablets in the source keyspace that can accept the reverse stream.
  2. Lock source + target keyspaces + workflow.
  3. Stop writes on source, buffer queries at proxy.
  4. Wait for forward replication to fully catch up.
  5. Create reverse VReplication workflow (target → source), primed at the current logical time so no writes are lost across the flip.
  6. Flip schema routing rules to target.
  7. Release buffered queries to target.
  8. Start reverse workflow. Writes now land on target and stream back to source.
  9. Freeze the original forward workflow (state retained, cannot be manipulated).
  10. Release locks.

Post-cutover:

  • Customer can call MoveTables ReverseTraffic to swap back. Same cutover sequence, reversed direction.
  • Customer can call MoveTables SwitchTraffic again. Same forward sequence.
  • Ping-pong as many times as needed.
  • When confident, customer calls MoveTables Complete, which tears down the reverse workflow and the migration- related routing-rule artefacts. At this point the migration is committed and the old system is no longer kept in sync.

(Source: sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale.)

When to apply

Aggressively, on any cross-version / cross-engine / sharding- topology / hosting-vendor migration. The cost of running the reverse workflow is linear in write rate × migration- observation horizon — for most customers a small fraction of steady-state database cost, and far cheaper than the option value of being able to cut back without data loss.

Less aggressively on within-version / within-engine / same-topology migrations where regressions are unlikely — but even then a running reverse workflow makes incident response vastly easier.

Composes with

Seen in

  • sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale — canonical wiki instance. Matt Lord's explicit framing: "Reverse replication is put in place so that if for any reason we need to revert the cutover, we can do so without data loss or downtime (this can be done back and forth as many times as necessary)." The reverse workflow is an explicit step in MoveTables SwitchTraffic and is held running until the customer calls MoveTables Complete. Named explicitly as risk-mitigation against "you may be going from one MySQL, or even MariaDB, version to another and from an unsharded database to a sharded one" kinds of regressions — i.e. exactly the cross-version + topology- change cases where surprises are most likely.
Last updated · 319 distilled / 1,201 read