Skip to content

PLANETSCALE 2023-11-20

Read original ↗

PlanetScale — Three surprising benefits of sharding a MySQL database

Summary

Brian Morrison II's 2023-11-20 PlanetScale pedagogy post frames horizontal sharding as a scaling technique whose non-throughput benefits are often under-appreciated. Beyond the usual "shard to handle more traffic" motivation, sharding yields three distinct architectural wins: (1) reduced failure blast radius — a shard going down affects only the customers assigned to that shard, not the whole application; (2) operational ease at scale — both backups and ghost-table schema migrations run N-way in parallel across shards, dramatically reducing wall-clock time; (3) linear + cheaper cost scaling — adding shards of the same spec is predictable and avoids cloud-storage IOPS cliffs that force premium-tier volumes. The post is short pedagogy (no benchmarks, no diagrams beyond stock illustrations), but canonicalises a framing — "two is one, and one is none" applied to the database tier — that previous PlanetScale sharding posts (Dicken's backup and IOPS pieces) approached from specific angles.

Canonical framings

"Two is one, and one is none"

Morrison opens with the infrastructure-architecture adage:

"There's an old saying in architecting infrastructure: two is one, and one is none. The implication is that you should never have one of anything, as it creates a single point of failure. This is true for your database as well, perhaps more so since it is a critical part of your application. In a typical MySQL environment, if the database server goes down, the entire application goes down with it."

This is the redundancy / failure-domain framing applied at the database tier. Unsharded = single point of failure (entire app goes down); sharded = failure domain spread such that one shard's outage only affects a slice of customers.

Worked scenario: five-shard customer-range sharding

Morrison uses a concrete teaching example: sharding by customer-ID range across five shards.

"Consider a scenario where you shard based on ranges of customers using the customer ID. If shard A goes down, it will make a bad day for customers 1-5, but the remaining shards are actually still online and can serve data with no problem."

The scenario is a clean teaching instance of sharded failure-domain isolation: a single shard's outage caps the blast radius to ~1/N of the customer base. Outage impact is contained to affected customers + their teams, and the remaining shards continue to serve traffic normally.

Morrison extends this to revenue impact: "This does not consider any lost revenue from the outage, which is also minimized." Lost-revenue exposure is ~1/N of the unsharded case under uniform customer distribution.

Operational ease — backup and schema-migration parallelism

Morrison gives the backup comparison at teaching altitude:

"Consider backing up a 1TB database. Not only does the process take a long time, but it can have a significant impact on how fast your database responds to queries. Now let's take that same database and create a sharded environment where the data is evenly split across five shards, similar to the previous example. Not only is backing up 5× 200 GB databases quicker, but if you ever have to restore data from those databases, that process will be faster as well."

This is the shard-parallel backup property at pedagogy altitude (the load-bearing production numbers are canonicalised by Dicken 2024-07-30 Faster backups with sharding: 20 TB / 32 shards / 1 h 39 min instead of ~63 h unsharded).

Morrison then extends the same property to ghost-table schema migrations:

"Schema migrations are another task that can be performed more efficiently. For example, when you merge in a Deploy Request on PlanetScale, we'll create a new table on the target database branch with the updated version of the schema and sync data from the live table into this 'ghost table'. Once the changes are merged in, the old table is dropped and the 'ghost table' becomes the new production table. Using the same scenario from above, performing this operation on the smaller databases in parallel will dramatically reduce the time it takes to complete."

This is a load-bearing extension: the ghost-table online DDL mechanism is bottlenecked on the full-table copy, which at 1 TB scale is hours-long. Running N parallel 1/N-sized ghost-table migrations on per-shard databases dramatically reduces the wall-clock. The same property that makes shard-parallel backup work makes shard-parallel ghost-table migration work — both are embarrassingly parallel under horizontal sharding.

Cost — linear scaling + commodity disks

Morrison frames the naive counter-argument ("more servers = more cost") and then rebuts it on two axes.

First axis — vertical scaling's compounding overhead:

"When you provision a server, you need enough resources (CPU, memory, IOPS) to run whatever it is you are trying to run, as well as the necessary overhead to accommodate usage spikes. As the application scales, you'll eventually start reaching the limits of your server and need to bump resources along with even more overhead to support the service. This cycle continues, resulting in you always paying for more than you actually use."

The unsharded path pays overhead-on-overhead as workload grows. Sharded alternative:

"Whenever the load exceeds what the allocated resources can handle, you add another server into the environment with the same specs and rebalance the load across those servers. There may still be some overhead, but it's significantly lower than what's required when scaling vertically. Plus, since you are adding another server with the same specs, the overall cost increases more linearly and predictably, something your finance team will appreciate."

Second axis — commodity vs premium storage:

"Another way that sharding can save you money is by utilizing commodity disks in cloud infrastructure. As your database is used more and more, it increases the demand on the underlying storage in the form of more required IOPS. Lower-cost virtual disks often have a set limit to the amount of IOPS granted to them before you have to select a more costly option. This can creep up on cloud architects if it's not accounted for. By sharding your database across multiple, lower-cost disks, you can save money by avoiding the additional costs of their more expensive counterparts."

This is the same architectural claim Dicken 2024-08-19 Increase IOPS and throughput with sharding canonicalises with specific numbers (worked $20,520 RDS + io1 vs $13,992 8-shard PlanetScale at 8× workload). Morrison's 2023-11-20 post predates Dicken's by ~9 months and canonicalises the framing without the arithmetic. Both map onto the linear-vs-superlinear cost scaling concept: cloud-storage IOPS pricing is a staircase with risers (gp3 → io1/io2 regime shift), and sharding avoids the risers by keeping each shard below the cheap-tier ceiling.

Key takeaways

  1. Sharding's three non-throughput benefits (from the title): blast-radius reduction, operational ease, and cost predictability. The throughput benefit is the obvious one; Morrison's contribution is canonicalising that the other three are structurally equivalent in engineering value.
  2. "Two is one, and one is none" applies at the database tier. Unsharded = single point of failure. Sharded = N smaller failure domains, each bounding its customer exposure to ~1/N of the fleet. First-order effect is availability; second-order effect is revenue-loss containment.
  3. Backup wall-clock scales by 1/N under sharding because backup is embarrassingly parallel. Same property applies in reverse to restore (the more operationally important direction — slow restore is an extended outage).
  4. Ghost-table schema migrations inherit the same parallelism. PlanetScale's deploy-request mechanism creates a ghost table, copies rows from live → ghost, swaps names atomically. Running N ghost-table migrations in parallel across shards reduces a 1 TB migration to N parallel 1/N-sized migrations.
  5. Cost scales linearly in shard count, not super-linearly. Each shard of the same spec adds the same incremental cost — "something your finance team will appreciate". The alternative (scaling up a single instance) pays compounding overhead as the instance crosses resource tiers.
  6. Commodity disks work when each shard's IOPS demand fits under the cheap-tier ceiling. gp3-style commodity volumes have IOPS limits; sharding's per-shard IOPS is ~1/N of the aggregate, keeping each shard below the ceiling and avoiding the premium-tier cost multiplier.
  7. The four benefits are additive, not overlapping. An 8-shard deployment simultaneously (a) caps blast radius at 1/8, (b) parallelises backups 8-way, (c) parallelises schema migrations 8-way, (d) fits on 8 cheap gp3 volumes instead of 1 expensive io1 volume. Shard-count choice amortises across all four dimensions.

Systems extracted

  • PlanetScale — the product whose customers access this set of benefits. The post is pedagogy shaped around PlanetScale's sharded MySQL offering (Vitess under the hood).
  • Vitess — the underlying sharded-MySQL substrate. Morrison doesn't name it explicitly in this post but the ghost-table + deploy-request mechanics are Vitess + PlanetScale-native.
  • AWS EBS — the "lower-cost virtual disks" + "set limit to the amount of IOPS" reference is a thinly-veiled gp3-vs-io1/io2 framing.

Concepts extracted

  • concepts/horizontal-sharding — the root technique this post advocates. Morrison's scope is specifically about benefits beyond throughput.
  • concepts/sharded-failure-domain-isolation — new canonical concept: sharding caps a single-shard outage's blast radius to ~1/N of customers, and lost-revenue exposure along with it. First benefit Morrison names.
  • concepts/blast-radius — parent framing for the failure-domain-isolation claim. Morrison's "impact of an outage is more isolated" language is a direct blast-radius framing at the database-tier altitude.
  • concepts/shard-parallel-backup — the second benefit's mechanism. Morrison doesn't give production numbers; Dicken's 2024-07-30 faster-backups post is the load-bearing reference.
  • concepts/ghost-table-migration — the specific schema-migration primitive Morrison invokes as the parallelism example. Canonical PlanetScale name for the shadow-table online DDL primitive.
  • concepts/linear-vs-superlinear-cost-scaling — the cost-framing concept. Morrison's "cost increases more linearly and predictably" + "avoiding the additional costs of their more expensive counterparts" both map onto linear-vs-staircase cost scaling.
  • concepts/iops-throttle-network-storage — the "set limit to the amount of IOPS granted" framing is the IOPS-throttle concept at pedagogy altitude. The cheap-tier IOPS ceiling is what the cost argument is built around.

Patterns extracted

Operational numbers

None. Morrison gives no benchmarks, no measured wall-clock times, no cost numbers. All arithmetic is qualitative ("5× 200 GB is quicker than 1× 1 TB") or stock illustration (the "sharding-saves-money-and-improves-performance-diagram" and "vertical-vs-horizontal-scaling-cost-comparison" images). The load-bearing production numbers are in sibling Dicken posts:

Caveats

  • Pedagogy voice, zero numbers. Morrison's post is a short ~600-word framing piece aimed at an audience that may not yet be convinced sharding is worth it. No architecture diagrams (the two images are stock marketing illustrations), no benchmarks, no production incident content.
  • Marketing framing. Published on the PlanetScale blog, the post is pitched at developers deciding whether to adopt a sharded-MySQL vendor. Every benefit is one PlanetScale delivers. The structural claims are generic to horizontal sharding (any Vitess-based or in-house-sharded stack sees the same benefits) but the post doesn't engage with the non-PlanetScale alternatives (Vitess-self-managed, CockroachDB, TiDB, Spanner).
  • Uniform-distribution assumption. The "1/N blast radius" framing assumes customers are uniformly distributed across shards. A hot shard (one customer generating 10× the load of their peers) concentrates blast-radius risk onto that single shard's outage — the "bad day for customers 1-5" framing ignores that customer 3 might be 90% of the outage's revenue impact. See concepts/shard-key for shard-key distribution considerations.
  • Range-based sharding choice. Morrison's worked example uses range-based sharding (customers 1-5 on shard A, 6-10 on shard B, etc.), which is one of three sharding strategies — the others being directory-based and hash-based (Guevara's 2024-07-08 strategies post). Range-based is pedagogically clean but hotspot-prone in practice (customer-ID ranges often correlate with signup time, and the newest customers are typically the heaviest workloads). Hash-based sharding on customer-ID gives better failure-domain uniformity for the same N.
  • No downside acknowledgment. Morrison doesn't engage with the scatter-gather cost of cross-shard queries, the SQL-feature-set shrinkage under sharding (foreign keys, cross-shard joins, globally unique indexes), or the operational overhead of managing N replicas per shard. The post is pure benefits-framing.
  • Cross-shard schema migrations get more complex, not simpler. While per-shard ghost-table copies run in parallel, the coordination of a multi-shard schema change (ensuring all shards apply the change atomically from the application's perspective) is more complex than a single-instance DDL. See concepts/multi-shard-schema-sync + patterns/near-atomic-multi-change-deployment for the PlanetScale-canonical coordination mechanism.
  • "Commodity disks" is AWS-specific. The gp3-vs-io1/io2 tier structure is EBS-specific. GCP (Persistent Disk Balanced vs Extreme) and Azure (Premium SSD v2 vs Ultra Disk) have analogous tiers but different ceiling values; on-prem NVMe dissolves the staircase entirely (patterns/direct-attached-nvme-with-replication + systems/planetscale-metal).
  • PlanetScale Metal (2025-03) partially obsoletes the "commodity disks" framing. When storage is direct-attached NVMe rather than network-attached EBS, the IOPS-ceiling regime shift doesn't exist — so the "shard to stay on cheap disks" argument is moot for Metal customers. Morrison's post predates Metal by ~15 months; the cost-saving claim stands for standard (non-Metal) PlanetScale but the structural framing is evolving.

Source

Last updated · 470 distilled / 1,213 read