PATTERN Cited by 2 sources
Shared-nothing storage topology¶
Problem¶
On a cluster where all nodes share a storage fabric (e.g. EBS on AWS), one node's storage failure is correlated with other nodes' storage failures — see concepts/correlated-ebs-failure. The naive replication design ("put replicas on different volumes in the same AZ") does not deliver independent failure domains, because the volumes share a fabric.
At fleet scale, the probability of at least one active impacting event becomes near-certain (see concepts/blast-radius-multiplier-at-fleet-scale): a single-database-cluster example on 768 gp3 volumes in the same AZ hits 99.65% probability of an active degradation at any given moment. The fix is structural — not more health-check heuristics.
Solution¶
Give each cluster node its own storage. Share nothing at the storage layer.
Every node owns:
- A direct-attached NVMe drive (local to the instance, no network hop).
- Its own copy of the dataset (or its shard), maintained via application-level replication.
- Its own OS-page-cache + storage-stack queue — no cross-node contention.
The cluster shares only the application-level replication protocol — itself a low-bandwidth, well-understood channel.
"With a shared-nothing architecture that uses local storage instead of network-attached storage like EBS, the rest of the shards and nodes in a database are able to continue to operate without problem." (Source: sources/2025-03-18-planetscale-the-real-failure-rate-of-ebs.)
Structure¶
Shared-storage (EBS): Shared-nothing (Metal):
┌──────┐ ┌──────┐
│Node 1│─┐ │Node 1│──local NVMe
└──────┘ │ └──────┘
┌──────┐ │ shared ┌──────┐
│Node 2│─┼─fabric │Node 2│──local NVMe
└──────┘ │ └──────┘
┌──────┐ │ ┌──────┐
│Node 3│─┘ │Node 3│──local NVMe
└──────┘ └──────┘
└─ MySQL replication
fabric event hits all 3 protocol only
→ correlated failure fabric event on
node 1 doesn't
touch node 2/3
When to use¶
- OLTP databases at fleet scale. The storage-fabric-variance-floor problem is acute for OLTP — every commit is an IO, and an IO spike is a user-facing
- systems/planetscale-metal is the canonical wiki instance.
- Replication-native workloads. The pattern works when the application already has a replication protocol (MySQL / Postgres streaming replication, Cassandra, Kafka, DynamoDB). The cluster already knows how to heal a lost node; moving to shared-nothing just means "lose a drive" triggers the same protocol.
- Stateful workloads where local IO latency matters. Saves the ~5× network-round-trip penalty on every IO — see concepts/network-attached-storage-latency-penalty.
- When the variance floor of the provider's network storage is a customer-visible incident. Manual health monitoring + reparent (see patterns/automated-volume-health-monitoring + patterns/zero-downtime-reparent-on-degradation) is operational bandage; shared-nothing is the structural fix.
When not to use¶
- Stateless workloads. Stateless services don't own data, so storage fabric failure rates don't affect them directly. Pay for the elasticity of EBS; don't bother with local NVMe.
- Elastic-capacity-required workloads. EBS volumes can be resized in place. Local NVMe is fixed at provision time — resizing requires replacing the instance. If the workload has unpredictable storage growth and can't tolerate replace-and-migrate cycles, EBS wins.
- Workloads already well-served by read-replicas or caching. Workloads where the hot dataset fits in RAM and reads dominate can hide EBS variance behind a cache layer.
Trade-offs¶
- Durability is now the application's problem. Local drives fail. The cluster replication protocol has to be robust to independent node loss + auto-detect + auto-heal. See patterns/direct-attached-nvme-with-replication.
- Capacity elasticity is slower. Resizing = migrate to a bigger instance, not a volume modify-in-place. Still doable without downtime on a sharded cluster, but with more moving parts.
- Backups no longer come for free from the storage layer. No EBS snapshot API; the cluster has to take its own backups.
- Per-instance cost calculus changes. Local-NVMe EC2
instance types (
i4,i3en,im4gn) are storage-specific and priced differently; the TCO calculation shifts from "cheap EC2 + priced EBS" to "all- inclusive storage-instance". - Correlated-failure envelope shrinks but doesn't vanish. AZ-wide power event still takes down an entire AZ's nodes. Shared-nothing eliminates fabric-level correlated failure but not AZ-level.
Seen in¶
- sources/2025-03-18-planetscale-the-real-failure-rate-of-ebs — canonical wiki instance. PlanetScale's argument that shared-nothing-on-local-NVMe is the structural fix for EBS fleet-scale degradation.
- sources/2025-03-13-planetscale-io-devices-and-latency — complementary latency-side framing of the same Metal architecture.
Related¶
- patterns/direct-attached-nvme-with-replication
- patterns/automated-volume-health-monitoring
- patterns/zero-downtime-reparent-on-degradation
- concepts/correlated-ebs-failure
- concepts/performance-variance-degradation
- concepts/blast-radius-multiplier-at-fleet-scale
- concepts/network-attached-storage-latency-penalty
- concepts/compute-storage-separation
- systems/aws-ebs
- systems/planetscale-metal
- systems/vitess