CONCEPT

Sharded failure-domain isolation¶

Definition¶

Sharded failure-domain isolation is the architectural property that a horizontally-sharded database's outage blast radius is bounded to the customers assigned to the failing shard, rather than the whole customer fleet. With N evenly-distributed shards, a single shard going down affects ~1/N of customers; the remaining shards continue to serve traffic normally.

The property is structural — it follows from the sharding topology itself, not from any application-layer code. The shard boundary is the failure-domain boundary.

Morrison's canonical framing¶

Brian Morrison II (, 2023-11-20) opens with the infrastructure-architecture adage:

"There's an old saying in architecting infrastructure: two is one, and one is none. The implication is that you should never have one of anything, as it creates a single point of failure. This is true for your database as well, perhaps more so since it is a critical part of your application. In a typical MySQL environment, if the database server goes down, the entire application goes down with it."

And the sharded alternative:

"In sharded environments, this failure domain is actually spread out. Consider a scenario where you shard based on ranges of customers using the customer ID. If shard A goes down, it will make a bad day for customers 1-5, but the remaining shards are actually still online and can serve data with no problem. Since the impact of an outage is more isolated, there is less of an impact on various teams across your organization as they work to communicate with customers and recover from the failure. This does not consider any lost revenue from the outage, which is also minimized."

The three effects¶

Availability containment. Only the affected shard's customers see an outage. Under the PlanetScale teaching example (5 shards), 80% of customers see no impact at all. The incident is a partial outage from the company's perspective, full outage from the affected customer's perspective.
Response-team load reduction. The incident-response surface shrinks proportionally: customer-success communicates with 1/N of customers, engineering debugs 1/N of the data, status-page messaging is scoped to an affected-customer list. At unsharded scale, every customer is potentially affected, so every customer must be communicated to.
Revenue-loss containment. Lost-revenue exposure from the outage is bounded to the affected-customer slice's revenue contribution — approximately 1/N under uniform distribution. This is the second-order effect the first-order availability containment buys.

Relationship to blast radius ¶

Sharded failure-domain isolation is the database-tier instantiation of the general blast radius architectural principle. The shard is a specific named blast-radius boundary on the standard ladder:

AWS partition (sovereignty)
AWS Region (availability / geography)
AWS Availability Zone (datacentre fault)
AWS account (security + quota + billing)
Cluster / fleet / service deployment
Database shard ← this concept
Tenant (authorization inside a shared environment)
Request

Sharding adds a structural boundary between "tenant" and "cluster": customers are partitioned into groups of ~N/total-customers, and each group's failure domain is its own shard. A tenant-level authorization boundary inside an unsharded database catches authorization bugs but not infrastructure outages; a shard boundary catches both.

Composition with other isolation mechanisms¶

Sharded failure-domain isolation composes multiplicatively with other isolation primitives:

Per-shard multi-AZ replication (patterns/multi-az-vitess-cluster) — each shard's primary + replicas are spread across AZs, so an AZ failure is caught at the sub-shard level rather than taking a shard down. Combined: an AZ failure doesn't take any shard down; a data-loss event on one shard's primary is caught by replica promotion; a cross-shard catastrophe (lost shard entirely) affects only the shard's customers.
Shuffle sharding — an orthogonal isolation technique applied inside a single service tier. Where sharding-by-customer-range puts all of customer X's data on shard A (failure of A = total loss for customer X), shuffle sharding distributes each customer's workload across a randomised K-sized subset of N nodes, so a bad customer or bad node can only fail the (K choose N) combinations that overlap with the affected set. Shuffle sharding is finer-grained but doesn't address database-tier failure domains directly.
Active multi-cluster — the same principle applied at the service-cluster tier. Sharding is the storage-tier analog.

When the property fails¶

The 1/N bound assumes uniform distribution of customers and workload across shards. Failure modes that break the bound:

Skewed shard key — if shard A has 3× the customers of shard B, losing shard A affects 3× the blast radius of losing shard B. Range-based sharding on customer-ID is particularly vulnerable to this: signup order correlates with usage intensity, so the "newest customers" shard may hold 60% of the active traffic.
Hot customer concentration — one whale customer can account for 90% of workload. Losing their shard is effectively losing the whole revenue base for the duration. Morrison's 1-5/6-10/11-15/16-20/21-25 illustration hides this risk; hash-based sharding mitigates but does not eliminate it.
Cross-shard dependencies — if customer X's requests touch shards A, B, and C (via cross-shard transactions or scatter-gather queries), then losing shard A degrades customer X's experience even though they're "primarily" on shard B. The property assumes a customer's data is confined to their one shard.
Shared infrastructure outages — if all shards share a network path, a DNS zone, an auth service, or a control plane, a failure in that shared primitive takes all shards down simultaneously. The isolation is only as strong as its weakest shared dependency. See "parts in the critical path have as few dependencies as possible" (Englander, 2025-07-03).
Correlated failures — if all shards run on the same hardware type, same AMI, same database version, and an issue hits that version/type, all shards fail simultaneously. Sharding gives you independence of location but not implementation. This is the same failure mode that correlated EBS failure captures at the volume-level altitude.

Quantitative framing¶

Under uniform distribution with N shards:

N shards	Outage blast radius	Unaffected customers
1 (unsharded)	100%	0%
2	50%	50%
5 (Morrison's example)	20%	80%
10	10%	90%
100	1%	99%
1000	0.1%	99.9%

The marginal isolation benefit diminishes fast: going from 1 shard → 2 shards halves exposure; going from 100 → 1000 shards drops exposure by 10× but at 10× shard-management cost. The sweet spot is typically 4-64 shards for most use cases, with hyperscalers (Facebook, Google, PlanetScale's largest customers) going to hundreds or thousands.

Seen in¶

* — canonical wiki disclosure. Brian Morrison II (PlanetScale, 2023-11-20) names this as the *first of three surprising sharding benefits, framed via the "two is one, and one is none" infrastructure-architecture adage applied at the database tier. Worked teaching example: 5-shard customer-range sharding where "if shard A goes down, it will make a bad day for customers 1-5, but the remaining shards are actually still online and can serve data with no problem." Explicitly connects the availability containment to revenue-loss containment ("lost revenue from the outage, which is also minimized").