PATTERN Cited by 1 source
Canary-shard substrate migration¶
Pattern¶
When migrating a horizontally-sharded database fleet to a new storage substrate (e.g. EBS → direct-attached NVMe, gp3 → io2, rotating disk → SSD), migrate one shard first — specifically, the busiest one — let it soak for a few days, then roll the remaining shards if the canary validates. The pattern inverts the naive cautious-rollout instinct (start with the quietest shard) in favour of maximum-signal-first.
The PlanetScale Insights instance¶
"To do this, we picked 1 of our 8 MySQL shards, the busiest one, to upgrade first. … Upgrading a test shard to Metal causes a substantial decrease in latency across all the measured percentiles. After the Metal upgrade, our busiest shard with the highest latencies started executing queries faster than the other shards by a significant margin. After letting the first upgrade soak for a few days, we upgraded the remaining shards and saw nearly identical improvement in performance."
(Source: sources/2026-04-21-planetscale-upgrading-query-insights-to-metal.)
PlanetScale's Insights pipeline was an 8-shard MySQL cluster driven by 800 concurrent writer threads; the busiest shard was the one with the worst latencies on EBS. Picking it first:
- Maximised signal-to-noise — the shard most likely to visibly improve from the substrate swap.
- Kept a clean baseline — 7 un-migrated shards remain as a live control group on the same real workload.
- Bounded blast radius — each shard is already an independent failure domain, so the migration risk is 1/N of customers.
- Validated the new substrate under real load — canary soak under the worst workload is a better test than canary under the easiest workload.
Mechanics¶
- Pick the busiest shard — highest per-shard load by the metric the substrate swap is supposed to improve (write latency, read QPS, IOPS saturation). The noisiest shard produces the clearest before/after signal.
- Migrate the substrate only — no application, schema, or sharding-configuration changes. Single-variable test.
- Observe per-shard percentiles side-by-side — p50 / p90 / p95 / p99 graphs with one line per shard make the substrate swap visible as a single shard diverging from the pack. The Insights post's graphs "the purple line corresponds to our busiest shard" are the canonical visualisation.
- Soak for a few days — validate the upgrade under:
- Peak daily-cycle load
- Any weekly / business-hour load patterns
- Backup windows
- Replication topology events (failover, reparent)
- Any substrate-specific failure modes that only manifest over time (noisy-neighbour, correlated storage events, drive-wear artefacts)
- Roll out to remaining shards — if the canary soak is clean, migrate the other N−1 shards. The Insights post's outcome: "we upgraded the remaining shards and saw nearly identical improvement in performance."
Why busiest-first¶
The common alternative (quietest-first) has two flaws:
- Low signal: a quiet shard won't show substrate-swap benefits because it wasn't being stressed on the old substrate. Operators can't distinguish "the swap works" from "the load was never high enough to matter."
- Worst-case blind spot: if the new substrate has a high-concurrency failure mode, a quiet shard will never trigger it. The canary has to exercise the new substrate to validate it.
Busiest-first is the empirical equivalent of load-testing under production traffic — the canary shard runs the worst- case shape of the workload on the new substrate, so the soak period is a production stress test of the migration.
Composition with sharding topology¶
This pattern is only possible because horizontal sharding already creates N independent failure domains. The migration operation is scoped to a single shard; the rest of the fleet is unaffected. This composes cleanly with:
- patterns/sharding-as-iops-scaling — the same shard decomposition that spreads IOPS cost is what enables per-shard substrate migration.
- concepts/sharded-failure-domain-isolation — blast radius of a bad migration is capped at 1/N.
- patterns/operator-scheduled-cutover — the per-shard cutover is a per-shard Vitess operator operation.
- concepts/shard-parallel-backup — if the migration requires a backup/restore to the new substrate, each shard can migrate in parallel at the limit.
When to use¶
- Substrate migration (storage type, instance family, region, AZ layout) on a horizontally-sharded fleet.
- Any change where per-shard metrics differ significantly — the canary signal requires per-shard visibility.
- New substrate has real uncertainty — enough that a days-long soak is valuable vs. an immediate full rollout.
When not to use¶
- Un-sharded system. No per-shard decomposition, no canary-shard surface.
- Migration cannot be scoped to one shard. If the new substrate requires cluster-wide topology changes, this pattern doesn't apply; use [[patterns/progressive-delivery- per-database|progressive delivery per database]] instead.
- Uneven shards. If shard sizes differ wildly (1 huge shard + 7 tiny shards), the "busiest" shard may also be the only shard where migration is risky; operators may prefer a medium-load shard first.
- Blue/green is available at the shard level. Per-shard blue/green (dual-write + cutover) eliminates the rollback pain of an in-place substrate swap.
Trade-offs vs alternatives¶
| Approach | When it's better |
|---|---|
| Canary-shard busiest-first (this pattern) | Sharded fleet + per-shard independent substrate + real substrate uncertainty |
| Canary-shard quietest-first | Substrate is known-stable but the migration process is novel |
| All-shard simultaneous migration | Substrate + migration both known-stable; speed > caution |
| Blue/green per shard | In-place rollback is impossible; parallel old+new clusters acceptable |
| patterns/progressive-delivery-per-database | Un-sharded; need customer-level granularity instead |
Seen in¶
- sources/2026-04-21-planetscale-upgrading-query-insights-to-metal — canonical instance. PlanetScale's Insights Kafka → sharded-MySQL pipeline migrated from EBS (with provisioned IOPS) to Metal direct-attached NVMe. Busiest of 8 MySQL shards upgraded first; "purple line" in the per-shard percentile plots went from worst to best across p50/p90/p95/p99. Soaked for "a few days", then the remaining 7 shards were upgraded to "nearly identical improvement in performance."
Related¶
- patterns/direct-attached-nvme-with-replication
- patterns/sharding-as-iops-scaling
- patterns/progressive-delivery-per-database
- patterns/operator-scheduled-cutover
- concepts/horizontal-sharding
- concepts/sharded-failure-domain-isolation
- concepts/io-latency-sensitive-workload
- concepts/shard-parallel-backup
- systems/planetscale-metal
- systems/vitess