Skip to content

PATTERN Cited by 1 source

Canary-shard substrate migration

Pattern

When migrating a horizontally-sharded database fleet to a new storage substrate (e.g. EBS → direct-attached NVMe, gp3 → io2, rotating disk → SSD), migrate one shard first — specifically, the busiest one — let it soak for a few days, then roll the remaining shards if the canary validates. The pattern inverts the naive cautious-rollout instinct (start with the quietest shard) in favour of maximum-signal-first.

The PlanetScale Insights instance

"To do this, we picked 1 of our 8 MySQL shards, the busiest one, to upgrade first. … Upgrading a test shard to Metal causes a substantial decrease in latency across all the measured percentiles. After the Metal upgrade, our busiest shard with the highest latencies started executing queries faster than the other shards by a significant margin. After letting the first upgrade soak for a few days, we upgraded the remaining shards and saw nearly identical improvement in performance."

(Source: sources/2026-04-21-planetscale-upgrading-query-insights-to-metal.)

PlanetScale's Insights pipeline was an 8-shard MySQL cluster driven by 800 concurrent writer threads; the busiest shard was the one with the worst latencies on EBS. Picking it first:

  1. Maximised signal-to-noise — the shard most likely to visibly improve from the substrate swap.
  2. Kept a clean baseline — 7 un-migrated shards remain as a live control group on the same real workload.
  3. Bounded blast radius — each shard is already an independent failure domain, so the migration risk is 1/N of customers.
  4. Validated the new substrate under real load — canary soak under the worst workload is a better test than canary under the easiest workload.

Mechanics

  1. Pick the busiest shard — highest per-shard load by the metric the substrate swap is supposed to improve (write latency, read QPS, IOPS saturation). The noisiest shard produces the clearest before/after signal.
  2. Migrate the substrate only — no application, schema, or sharding-configuration changes. Single-variable test.
  3. Observe per-shard percentiles side-by-side — p50 / p90 / p95 / p99 graphs with one line per shard make the substrate swap visible as a single shard diverging from the pack. The Insights post's graphs "the purple line corresponds to our busiest shard" are the canonical visualisation.
  4. Soak for a few days — validate the upgrade under:
  5. Peak daily-cycle load
  6. Any weekly / business-hour load patterns
  7. Backup windows
  8. Replication topology events (failover, reparent)
  9. Any substrate-specific failure modes that only manifest over time (noisy-neighbour, correlated storage events, drive-wear artefacts)
  10. Roll out to remaining shards — if the canary soak is clean, migrate the other N−1 shards. The Insights post's outcome: "we upgraded the remaining shards and saw nearly identical improvement in performance."

Why busiest-first

The common alternative (quietest-first) has two flaws:

  • Low signal: a quiet shard won't show substrate-swap benefits because it wasn't being stressed on the old substrate. Operators can't distinguish "the swap works" from "the load was never high enough to matter."
  • Worst-case blind spot: if the new substrate has a high-concurrency failure mode, a quiet shard will never trigger it. The canary has to exercise the new substrate to validate it.

Busiest-first is the empirical equivalent of load-testing under production traffic — the canary shard runs the worst- case shape of the workload on the new substrate, so the soak period is a production stress test of the migration.

Composition with sharding topology

This pattern is only possible because horizontal sharding already creates N independent failure domains. The migration operation is scoped to a single shard; the rest of the fleet is unaffected. This composes cleanly with:

When to use

  • Substrate migration (storage type, instance family, region, AZ layout) on a horizontally-sharded fleet.
  • Any change where per-shard metrics differ significantly — the canary signal requires per-shard visibility.
  • New substrate has real uncertainty — enough that a days-long soak is valuable vs. an immediate full rollout.

When not to use

  • Un-sharded system. No per-shard decomposition, no canary-shard surface.
  • Migration cannot be scoped to one shard. If the new substrate requires cluster-wide topology changes, this pattern doesn't apply; use [[patterns/progressive-delivery- per-database|progressive delivery per database]] instead.
  • Uneven shards. If shard sizes differ wildly (1 huge shard + 7 tiny shards), the "busiest" shard may also be the only shard where migration is risky; operators may prefer a medium-load shard first.
  • Blue/green is available at the shard level. Per-shard blue/green (dual-write + cutover) eliminates the rollback pain of an in-place substrate swap.

Trade-offs vs alternatives

Approach When it's better
Canary-shard busiest-first (this pattern) Sharded fleet + per-shard independent substrate + real substrate uncertainty
Canary-shard quietest-first Substrate is known-stable but the migration process is novel
All-shard simultaneous migration Substrate + migration both known-stable; speed > caution
Blue/green per shard In-place rollback is impossible; parallel old+new clusters acceptable
patterns/progressive-delivery-per-database Un-sharded; need customer-level granularity instead

Seen in

  • sources/2026-04-21-planetscale-upgrading-query-insights-to-metal — canonical instance. PlanetScale's Insights Kafka → sharded-MySQL pipeline migrated from EBS (with provisioned IOPS) to Metal direct-attached NVMe. Busiest of 8 MySQL shards upgraded first; "purple line" in the per-shard percentile plots went from worst to best across p50/p90/p95/p99. Soaked for "a few days", then the remaining 7 shards were upgraded to "nearly identical improvement in performance."
Last updated · 470 distilled / 1,213 read