PATTERN

Reverse replication for rollback¶

Problem¶

A well-prepared zero-downtime database migration (data copied, continuous replication live, VDiff clean) can still go wrong after cutover for reasons that only manifest under real production traffic:

Query-plan regressions the query optimiser makes differently on the new engine version.
Lock contention patterns that only emerge at production concurrency.
Collation / charset / generated-column differences that don't show up in test workloads.
Connection-pool behaviour that doesn't match the old system.
Customer-specific load patterns the new system hasn't seen before.

The customer discovers the regression in the first minutes of post-cutover production traffic — when the old system is already strictly stale because it stopped accepting writes at cutover time. The only way out is to reconcile the diverged states manually, which at petabyte scale is effectively a brand-new migration project.

The cutover has become a one-way door.

Solution¶

At cutover, create a reverse replication workflow (target → source) and keep it running until the customer explicitly finalises the migration. The reverse workflow applies every write landing on the new system back to the old system, so the old system stays current. If the new system misbehaves, traffic can be cut back to the old system — which now has every write that happened post- cutover — without data loss.

The cutover is now a revolving door: the customer can switch back and forth as many times as needed, and call Complete only when they are 100% confident in the new system.

Mechanics¶

At cutover time (in Vitess this is MoveTables SwitchTraffic):

Before flipping routing rules, ensure there are viable PRIMARY tablets in the source keyspace that can accept the reverse stream.
Lock source + target keyspaces + workflow.
Stop writes on source, buffer queries at proxy.
Wait for forward replication to fully catch up.
Create reverse VReplication workflow (target → source), primed at the current logical time so no writes are lost across the flip.
Flip schema routing rules to target.
Release buffered queries to target.
Start reverse workflow. Writes now land on target and stream back to source.
Freeze the original forward workflow (state retained, cannot be manipulated).
Release locks.

Post-cutover:

Customer can call MoveTables ReverseTraffic to swap back. Same cutover sequence, reversed direction.
Customer can call MoveTables SwitchTraffic again. Same forward sequence.
Ping-pong as many times as needed.
When confident, customer calls MoveTables Complete, which tears down the reverse workflow and the migration- related routing-rule artefacts. At this point the migration is committed and the old system is no longer kept in sync.

(Source: .)

When to apply¶

Aggressively, on any cross-version / cross-engine / sharding- topology / hosting-vendor migration. The cost of running the reverse workflow is linear in write rate × migration- observation horizon — for most customers a small fraction of steady-state database cost, and far cheaper than the option value of being able to cut back without data loss.

Less aggressively on within-version / within-engine / same-topology migrations where regressions are unlikely — but even then a running reverse workflow makes incident response vastly easier.

Composes with¶

patterns/routing-rule-swap-cutover — the reverse workflow exists to make the routing-rule flip reversible.
patterns/snapshot-plus-catchup-replication — the reverse workflow is itself a snapshot-plus-catch-up stream, just in the opposite direction.
concepts/query-buffering-cutover — query buffering at the proxy is what makes each flip invisible to the application.

Seen in¶

— canonical wiki disclosure of the reverse- replication-for-rollback primitive surfaced as a dashboard button. Ben Dicken's 2024-11-07 PlanetScale Workflows launch post names the reverse workflow explicitly as the rollback substrate: "We also give you an option to switch traffic back to the unsharded database, providing an escape hatch in case any unexpected problems arise from the sharded configuration." The Switch back button in the Workflows UI maps to MoveTables ReverseTraffic and exercises the reverse workflow created at SwitchTraffic — the escape-hatch property is preserved verbatim in the UI wrap. Canonical wiki application of the pattern on the unsharded-to-sharded migration axis (vs earlier sources' cross-version / cross-engine / cross-vendor framings): unsharded→sharded is exactly the regime shift where post-cutover surprises are most likely (cross-shard query performance, hotspot detection, application-side query-plan regressions), and the reverse workflow is the load-bearing safety net.
— canonical wiki instance. Matt Lord's explicit framing: "Reverse replication is put in place so that if for any reason we need to revert the cutover, we can do so without data loss or downtime (this can be done back and forth as many times as necessary)." The reverse workflow is an explicit step in MoveTables SwitchTraffic and is held running until the customer calls MoveTables Complete. Named explicitly as risk-mitigation against "you may be going from one MySQL, or even MariaDB, version to another and from an unsharded database to a sharded one" kinds of regressions — i.e. exactly the cross-version + topology- change cases where surprises are most likely.