Skip to content

PATTERN Cited by 2 sources

Shared-nothing storage topology

Problem

On a cluster where all nodes share a storage fabric (e.g. EBS on AWS), one node's storage failure is correlated with other nodes' storage failures — see concepts/correlated-ebs-failure. The naive replication design ("put replicas on different volumes in the same AZ") does not deliver independent failure domains, because the volumes share a fabric.

At fleet scale, the probability of at least one active impacting event becomes near-certain (see concepts/blast-radius-multiplier-at-fleet-scale): a single-database-cluster example on 768 gp3 volumes in the same AZ hits 99.65% probability of an active degradation at any given moment. The fix is structural — not more health-check heuristics.

Solution

Give each cluster node its own storage. Share nothing at the storage layer.

Every node owns:

  • A direct-attached NVMe drive (local to the instance, no network hop).
  • Its own copy of the dataset (or its shard), maintained via application-level replication.
  • Its own OS-page-cache + storage-stack queue — no cross-node contention.

The cluster shares only the application-level replication protocol — itself a low-bandwidth, well-understood channel.

"With a shared-nothing architecture that uses local storage instead of network-attached storage like EBS, the rest of the shards and nodes in a database are able to continue to operate without problem." (Source: sources/2025-03-18-planetscale-the-real-failure-rate-of-ebs.)

Structure

Shared-storage (EBS):                   Shared-nothing (Metal):
  ┌──────┐                                ┌──────┐ 
  │Node 1│─┐                              │Node 1│──local NVMe
  └──────┘ │                              └──────┘
  ┌──────┐ │ shared                       ┌──────┐
  │Node 2│─┼─fabric                       │Node 2│──local NVMe
  └──────┘ │                              └──────┘
  ┌──────┐ │                              ┌──────┐
  │Node 3│─┘                              │Node 3│──local NVMe
  └──────┘                                └──────┘
                                             └─ MySQL replication
     fabric event hits all 3                    protocol only
       → correlated failure                  fabric event on
                                               node 1 doesn't
                                               touch node 2/3

When to use

  • OLTP databases at fleet scale. The storage-fabric-variance-floor problem is acute for OLTP — every commit is an IO, and an IO spike is a user-facing
  • systems/planetscale-metal is the canonical wiki instance.
  • Replication-native workloads. The pattern works when the application already has a replication protocol (MySQL / Postgres streaming replication, Cassandra, Kafka, DynamoDB). The cluster already knows how to heal a lost node; moving to shared-nothing just means "lose a drive" triggers the same protocol.
  • Stateful workloads where local IO latency matters. Saves the ~5× network-round-trip penalty on every IO — see concepts/network-attached-storage-latency-penalty.
  • When the variance floor of the provider's network storage is a customer-visible incident. Manual health monitoring + reparent (see patterns/automated-volume-health-monitoring + patterns/zero-downtime-reparent-on-degradation) is operational bandage; shared-nothing is the structural fix.

When not to use

  • Stateless workloads. Stateless services don't own data, so storage fabric failure rates don't affect them directly. Pay for the elasticity of EBS; don't bother with local NVMe.
  • Elastic-capacity-required workloads. EBS volumes can be resized in place. Local NVMe is fixed at provision time — resizing requires replacing the instance. If the workload has unpredictable storage growth and can't tolerate replace-and-migrate cycles, EBS wins.
  • Workloads already well-served by read-replicas or caching. Workloads where the hot dataset fits in RAM and reads dominate can hide EBS variance behind a cache layer.

Trade-offs

  • Durability is now the application's problem. Local drives fail. The cluster replication protocol has to be robust to independent node loss + auto-detect + auto-heal. See patterns/direct-attached-nvme-with-replication.
  • Capacity elasticity is slower. Resizing = migrate to a bigger instance, not a volume modify-in-place. Still doable without downtime on a sharded cluster, but with more moving parts.
  • Backups no longer come for free from the storage layer. No EBS snapshot API; the cluster has to take its own backups.
  • Per-instance cost calculus changes. Local-NVMe EC2 instance types (i4, i3en, im4gn) are storage-specific and priced differently; the TCO calculation shifts from "cheap EC2 + priced EBS" to "all- inclusive storage-instance".
  • Correlated-failure envelope shrinks but doesn't vanish. AZ-wide power event still takes down an entire AZ's nodes. Shared-nothing eliminates fabric-level correlated failure but not AZ-level.

Seen in

Last updated · 319 distilled / 1,201 read