PATTERN Cited by 1 source
Notion four-step sharding migration¶
Notion's four-step framework for migrating from an unsharded database to a sharded one is Justin Gage's canonical pedagogy framing of a dual-write-plus-verify cutover, with each step named as a distinct phase. Source: Notion's engineering post "Sharding Postgres at Notion" as surfaced in (sources/2026-04-21-planetscale-what-is-database-sharding-and-how-does-it-work):
"1. Double-write: Incoming writes get applied to both the old and new databases. 2. Backfill: Once double-writing has begun, migrate the old data to the new database. 3. Verification: Ensure the integrity of data in the new database. 4. Switch-over: Actually switch to the new database. This can be done incrementally, e.g. double-reads, then migrate all reads."
The four phases¶
1. Double-write¶
New writes land in both the old (unsharded) and new (sharded) databases simultaneously. Old reads continue to be authoritative. The goal: the new database starts accumulating the tail of new writes from the double-write start moment onward, without the team having to pre-reconcile the historical data.
Double-write is the first phase not because it's the safest, but because it sets a fixed point in time from which data-level parity becomes possible: everything from T_double_write_on onward is dual-written; everything before is what backfill needs to copy.
2. Backfill¶
While double-write continues, a background process migrates historical data (everything written before T_double_write_on) from the old database to the new. The backfill reads from the old DB and writes to the new; it runs at a rate chosen to avoid impacting live traffic. On completion, the new DB holds both the historical rows and every write since double-write was enabled.
The ordering matters: double-write first, backfill second. If the order were reversed, writes that happened during the backfill window would be missing in the new database.
3. Verification¶
Before cutting traffic over, the team runs row-level integrity checks comparing old-DB and new-DB contents — row counts, checksums, sampled row comparisons, targeted query equivalence. Any divergence is investigated and closed before the next phase starts.
The verification phase is the operational gate — the moment the team decides whether the new database is trustworthy enough to become authoritative.
4. Switch-over (incremental)¶
Writes have been dual for a while; reads still go to the old DB. Switch-over flips reads over to the new DB, usually incrementally:
- Step 4a. Enable double-reads (new DB reads served but compared against old; discrepancies logged).
- Step 4b. Flip a fraction of read traffic (e.g. 1% → 10% → 50% → 100%) to the new DB as primary.
- Step 4c. Stop double-writing once new-DB reads are 100% and trusted.
- Step 4d. Decommission the old DB.
The double-reads → shift reads → stop double-write order is what makes the cutover non-destructive: at every point along the progression, the team can reverse the flip by sending reads back to the still-dual-written old DB.
Why these four steps in this order¶
Each phase establishes a precondition for the next:
| Phase | Establishes | Enables |
|---|---|---|
| Double-write | Fresh writes are in both DBs | Backfill has a bounded window |
| Backfill | Historical writes are in both DBs | Verification has complete data |
| Verification | Data parity is confirmed | Switch-over is safe to start |
| Switch-over | Reads move to new DB | Old DB can be decommissioned |
Violating the order leaves gaps (backfill before double-write → missing writes in the gap window) or unverified cutovers (switch-over before verification → trusting data you haven't checked).
Sharding vs single-DB migration¶
Gage's framing: a shard migration has more failure modes than a migration to a single new DB provider. Each phase applies to each shard in the new layout — the mental model is per-shard × 4-phase-cadence, not global-4-phase. The team can progress shards through phases at different rates, bounded by the weakest-shard's verification outcome.
Relationship to adjacent patterns¶
- Dual-write migration is the generic pattern; Notion's four-step is the specific composition with explicit verification-before-switch and incremental-switch-over steps. Dual-write-migration's Airbnb OTLP instance used the same "dual-emit → soak → flip" shape without the explicit verify step named.
- Expand-migrate-contract is the same-shape pattern at the schema-column altitude; Notion's four-step is at the database-cluster altitude. The expand = double-write, migrate = backfill, contract = switch-over-and-decomm correspondence is exact.
- Reshard-online-via-VReplication is the substrate-layer implementation of the four-step pattern at the Vitess level — VReplication performs backfill-plus-tail;
VDiffis verification;SwitchTrafficis switch-over. Teams building on Vitess get the four-step pattern for free from the substrate. - Unsharded-to-sharded migration — the migration axis this pattern addresses. Notion hand-built the pattern in 2021; by 2024-era Vitess it's a UI workflow.
When hand-rolling the pattern is still justified¶
Hand-rolling the four-step is warranted when:
- The substrate doesn't offer it. Teams on Postgres without Citus, or on custom sharded substrates, still need to hand-build the pattern.
- The data layer crosses substrates. Migrating from MySQL to a sharded Postgres deployment (or vice versa) can't use either substrate's built-in reshard.
- The cutover has external dependencies. Search indexes, CDC consumers, analytics pipelines need their own cutovers coordinated with the primary DB cutover.
Caveat: downtime remains a possibility¶
Gage's closing framing: "Each of these steps still introduces the possibility of downtime; it's just a risk you're going to have to take for changes at this scale." The four-step reduces expected downtime; it does not eliminate the possibility. Teams should have a per-phase rollback plan.
Seen in¶
- sources/2026-04-21-planetscale-what-is-database-sharding-and-how-does-it-work — Justin Gage (guest post for PlanetScale, 2023-04-06) surfaces the pattern from Notion's 2021 Sharding Postgres at Notion post and canonicalises the four-step naming + the incremental-switch-over elaboration ("This can be done incrementally, e.g. double-reads, then migrate all reads."). Gage's framing is the load-bearing contribution — the Notion post itself is one layer of indirection from the wiki.