PATTERN Cited by 1 source
Reverse replication for rollback¶
Problem¶
A well-prepared zero-downtime database migration (data copied, continuous replication live, VDiff clean) can still go wrong after cutover for reasons that only manifest under real production traffic:
- Query-plan regressions the query optimiser makes differently on the new engine version.
- Lock contention patterns that only emerge at production concurrency.
- Collation / charset / generated-column differences that don't show up in test workloads.
- Connection-pool behaviour that doesn't match the old system.
- Customer-specific load patterns the new system hasn't seen before.
The customer discovers the regression in the first minutes of post-cutover production traffic — when the old system is already strictly stale because it stopped accepting writes at cutover time. The only way out is to reconcile the diverged states manually, which at petabyte scale is effectively a brand-new migration project.
The cutover has become a one-way door.
Solution¶
At cutover, create a reverse replication workflow (target → source) and keep it running until the customer explicitly finalises the migration. The reverse workflow applies every write landing on the new system back to the old system, so the old system stays current. If the new system misbehaves, traffic can be cut back to the old system — which now has every write that happened post- cutover — without data loss.
The cutover is now a revolving door: the customer can
switch back and forth as many times as needed, and call
Complete only when they are 100% confident in the new
system.
Mechanics¶
At cutover time (in Vitess this is
MoveTables SwitchTraffic):
- Before flipping routing rules, ensure there are viable
PRIMARYtablets in the source keyspace that can accept the reverse stream. - Lock source + target keyspaces + workflow.
- Stop writes on source, buffer queries at proxy.
- Wait for forward replication to fully catch up.
- Create reverse VReplication workflow (target → source), primed at the current logical time so no writes are lost across the flip.
- Flip schema routing rules to target.
- Release buffered queries to target.
- Start reverse workflow. Writes now land on target and stream back to source.
- Freeze the original forward workflow (state retained, cannot be manipulated).
- Release locks.
Post-cutover:
- Customer can call
MoveTables ReverseTrafficto swap back. Same cutover sequence, reversed direction. - Customer can call
MoveTables SwitchTrafficagain. Same forward sequence. - Ping-pong as many times as needed.
- When confident, customer calls
MoveTables Complete, which tears down the reverse workflow and the migration- related routing-rule artefacts. At this point the migration is committed and the old system is no longer kept in sync.
(Source: sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale.)
When to apply¶
Aggressively, on any cross-version / cross-engine / sharding- topology / hosting-vendor migration. The cost of running the reverse workflow is linear in write rate × migration- observation horizon — for most customers a small fraction of steady-state database cost, and far cheaper than the option value of being able to cut back without data loss.
Less aggressively on within-version / within-engine / same-topology migrations where regressions are unlikely — but even then a running reverse workflow makes incident response vastly easier.
Composes with¶
- patterns/routing-rule-swap-cutover — the reverse workflow exists to make the routing-rule flip reversible.
- patterns/snapshot-plus-catchup-replication — the reverse workflow is itself a snapshot-plus-catch-up stream, just in the opposite direction.
- concepts/query-buffering-cutover — query buffering at the proxy is what makes each flip invisible to the application.
Seen in¶
- sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale
— canonical wiki instance. Matt Lord's explicit framing:
"Reverse replication is put in place so that if for any
reason we need to revert the cutover, we can do so
without data loss or downtime (this can be done back
and forth as many times as necessary)." The reverse
workflow is an explicit step in
MoveTables SwitchTrafficand is held running until the customer callsMoveTables Complete. Named explicitly as risk-mitigation against "you may be going from one MySQL, or even MariaDB, version to another and from an unsharded database to a sharded one" kinds of regressions — i.e. exactly the cross-version + topology- change cases where surprises are most likely.
Related¶
- concepts/reverse-replication-workflow
- concepts/online-database-import
- concepts/query-buffering-cutover
- concepts/schema-routing-rules
- systems/vitess-vreplication
- systems/vitess-movetables
- systems/vitess
- patterns/routing-rule-swap-cutover
- patterns/snapshot-plus-catchup-replication
- companies/planetscale