PATTERN Cited by 1 source

Shared-nothing storage topology¶

Problem¶

On a cluster where all nodes share a storage fabric (e.g. EBS on AWS), one node's storage failure is correlated with other nodes' storage failures — see concepts/correlated-ebs-failure. The naive replication design ("put replicas on different volumes in the same AZ") does not deliver independent failure domains, because the volumes share a fabric.

At fleet scale, the probability of at least one active impacting event becomes near-certain (see concepts/blast-radius-multiplier-at-fleet-scale): a single-database-cluster example on 768 gp3 volumes in the same AZ hits 99.65% probability of an active degradation at any given moment. The fix is structural — not more health-check heuristics.

Solution¶

Give each cluster node its own storage. Share nothing at the storage layer.

Every node owns:

A direct-attached NVMe drive (local to the instance, no network hop).
Its own copy of the dataset (or its shard), maintained via application-level replication.
Its own OS-page-cache + storage-stack queue — no cross-node contention.

The cluster shares only the application-level replication protocol — itself a low-bandwidth, well-understood channel.

"With a shared-nothing architecture that uses local storage instead of network-attached storage like EBS, the rest of the shards and nodes in a database are able to continue to operate without problem." (Source: .)

Structure¶

Shared-storage (EBS):                   Shared-nothing (Metal):
  ┌──────┐                                ┌──────┐ 
  │Node 1│─┐                              │Node 1│──local NVMe
  └──────┘ │                              └──────┘
  ┌──────┐ │ shared                       ┌──────┐
  │Node 2│─┼─fabric                       │Node 2│──local NVMe
  └──────┘ │                              └──────┘
  ┌──────┐ │                              ┌──────┐
  │Node 3│─┘                              │Node 3│──local NVMe
  └──────┘                                └──────┘
                                             └─ MySQL replication
     fabric event hits all 3                    protocol only
       → correlated failure                  fabric event on
                                               node 1 doesn't
                                               touch node 2/3

When to use¶

OLTP databases at fleet scale. The storage-fabric-variance-floor problem is acute for OLTP — every commit is an IO, and an IO spike is a user-facing
systems/planetscale-metal is the canonical wiki instance.
Replication-native workloads. The pattern works when the application already has a replication protocol (MySQL / Postgres streaming replication, Cassandra, Kafka, DynamoDB). The cluster already knows how to heal a lost node; moving to shared-nothing just means "lose a drive" triggers the same protocol.
Stateful workloads where local IO latency matters. Saves the ~5× network-round-trip penalty on every IO — see concepts/network-attached-storage-latency-penalty.
When the variance floor of the provider's network storage is a customer-visible incident. Manual health monitoring + reparent (see patterns/automated-volume-health-monitoring + patterns/zero-downtime-reparent-on-degradation) is operational bandage; shared-nothing is the structural fix.

When not to use¶

Stateless workloads. Stateless services don't own data, so storage fabric failure rates don't affect them directly. Pay for the elasticity of EBS; don't bother with local NVMe.
Elastic-capacity-required workloads. EBS volumes can be resized in place. Local NVMe is fixed at provision time — resizing requires replacing the instance. If the workload has unpredictable storage growth and can't tolerate replace-and-migrate cycles, EBS wins.
Workloads already well-served by read-replicas or caching. Workloads where the hot dataset fits in RAM and reads dominate can hide EBS variance behind a cache layer.

Trade-offs¶

Durability is now the application's problem. Local drives fail. The cluster replication protocol has to be robust to independent node loss + auto-detect + auto-heal. See patterns/direct-attached-nvme-with-replication.
Capacity elasticity is slower. Resizing = migrate to a bigger instance, not a volume modify-in-place. Still doable without downtime on a sharded cluster, but with more moving parts.
Backups no longer come for free from the storage layer. No EBS snapshot API; the cluster has to take its own backups.
Per-instance cost calculus changes. Local-NVMe EC2 instance types (i4, i3en, im4gn) are storage-specific and priced differently; the TCO calculation shifts from "cheap EC2 + priced EBS" to "all- inclusive storage-instance".
Correlated-failure envelope shrinks but doesn't vanish. AZ-wide power event still takes down an entire AZ's nodes. Shared-nothing eliminates fabric-level correlated failure but not AZ-level.

Seen in¶

— Canonical architectural launch-post statement of the shared-nothing topology for Metal. Richard Crowley (PlanetScale, 2025-03-11) canonicalises the cross-3-AZ + semi-sync shared-nothing shape: "semi-synchronous, row- based, MySQL replication from a primary to two replicas distributed across three availability zones within a cloud region" — explicit 3-AZ independent-failure-domain placement. Canonical statement that semi-sync replication substitutes for the network-attached-storage replication it replaces: "Notably absent from the basis for PlanetScale's durability is the replication built into network-attached storage products like Amazon EBS and Google Persistent Disk. These volumes fail, or get slow, to a significant enough degree that they simply aren't good enough as a basis for the durability of a production database." First wiki canonical statement that EBS's internal replication is rejected as a durability primitive in favour of application-level cross-AZ semi-sync. Durability math under the pattern: 0.000001% monthly write-availability loss, 0.00000000003% monthly data loss.
— canonical wiki instance. PlanetScale's argument that shared-nothing-on-local-NVMe is the structural fix for EBS fleet-scale degradation.
sources/2025-03-13-planetscale-io-devices-and-latency — complementary latency-side framing of the same Metal architecture.