Skip to content

PATTERN Cited by 1 source

Shadow-table online schema change

Problem

MySQL's native ALTER TABLE is often too blocking to run against a production table. Some schema changes qualify for ALGORITHM=INSTANT (a metadata-only flip, e.g. adding a column at the end) or ALGORITHM=INPLACE (an in-place rebuild the engine can sometimes do without copying every row). But the common shapes teams run in production — dropping a column, changing a column's type, rebuilding a primary key, changing charset on a large text column — require a full table rewrite. Running that rewrite against the live table either:

  • Blocks writes for the duration (worst case; tens of minutes to hours on a large table), or
  • Causes heavy IOPS contention + replication lag even if not strictly blocking.

The problem is to apply an arbitrary DDL to a live table without the applications on top of it seeing either downtime or sustained performance degradation.

Solution

Build a second table with the target schema, backfill it from the live table under a consistent snapshot, track concurrent writes through the change log, and atomically swap the two tables at a brief cut-over. The four-step shape, verbatim from the post:

"In short, online schema change tools copy the production table without data, apply the schema change to the copy, sync the data, and swap the tables."

The four steps in detail:

  1. Build shadow. CREATE TABLE <name>_shadow LIKE <name>, then apply the DDL to the shadow: ALTER TABLE <name>_shadow <ddl>. Shadow is empty at this point, so applying the DDL is cheap regardless of how expensive it would be on the filled production table.
  2. Backfill rows. Copy rows from production to shadow, ordered by primary key, in batches inside consistent non-locking snapshots so the reads don't block writes. Record the GTID position at which each batch was taken so the tool knows which binlog events to replay.
  3. Track concurrent writes. Tail the binlog for events that landed after the recorded GTID and touch rows already in the shadow; apply them to the shadow through the DDL's column-projection. Interleave catch-up and copy so the shadow converges on the production table — the classic snapshot + catch-up shape at table scope.
  4. Cut over. At a cut- over freeze point, briefly write-lock the production table, apply any residual binlog events to the shadow, mark the freeze-point GTID, RENAME TABLE production TO production_old, production_shadow TO production, release the lock. Downstream proxies like VTGate buffer queries during the lock so clients never see errors.

Instantiations

Multiple tools implement this same pattern with different degrees of operational rigour:

  • pt-online-schema-change (Percona Toolkit) — the original. Triggers on the production table fire synchronously on every write to mirror it into the shadow. Simple but imposes a write-path latency cost and requires trigger DDL privileges.
  • gh-ost (GitHub, originally by Shlomi Noach — author of the PlanetScale post) — triggerless. Tails the binlog from a replica instead of using triggers. Zero write-path cost at the source; the tool can throttle itself based on replication lag / load signals.
  • Vitess VReplication — online DDL as a first-class workflow. Backfill + catch- up phases are interleaved by GTID-set comparison; sidecar state is persisted transactionally so restart is the recovery path for any failure. Uniquely among the three, Vitess does not terminate the stream after cut-over, which enables instant schema revert via inverse replication.

Trade-offs

  • Double storage during the migration. The shadow is a full copy of the production table + the new schema. On a TB-scale table this is a non-trivial resource commitment.
  • Binlog retention becomes operationally load-bearing. If the shadow can't catch up inside the retention window, the migration has to restart from scratch. Forces a tuning relationship between copy-phase duration, interleaving cadence, and binlog retention.
  • DDL semantics must be projectable by the tool, not just by MySQL. For gh-ost and VReplication, the tool has to encode the column mapping (what forward-projection turns an old-schema row into a new-schema row); complex DDL (charset changes, generated columns, reordered primary keys) can surface correctness bugs the tool has to be aware of.
  • Cut-over is still write-locked briefly. The whole migration is non-blocking except for the freeze point — a brief table-level write lock at swap time. At the proxy layer this is invisible (VTGate buffers queries); at the MySQL layer it is a real lock.

Composes with

Seen in

  • sources/2026-04-21-planetscale-behind-the-scenes-how-schema-reverts-work — canonical wiki formulation. Guevara + Noach introduce the pattern as the common denominator of all online-DDL tools before diving into the VReplication-specific extension that enables schema reverts. The four-step-shape framing ("copy the production table without data, apply the schema change to the copy, sync the data, and swap the tables") is directly quoted. The post also names the five VReplication design properties that distinguish its instantiation of the pattern from pt-online-schema-change / gh-ost: copy-and-changelog-both-tracked, GTID-precise transaction mapping, GTID-driven interleaving, transactional sidecar-state coupling, and non-termination after cut-over.
Last updated · 319 distilled / 1,201 read