CONCEPT Cited by 1 source
Blast-radius multiplier at fleet scale¶
Definition¶
Blast-radius multiplier at fleet scale is the statistical
reality that, as the number of components in a fleet grows, the
probability of at least one active impacting event tends
rapidly toward 1 — even when any single component's failure
probability is small. For a fleet of N components, each with
independent probability p of being in a failing state, the
probability of at least one failing is 1 − (1 − p)^N.
The wiki's canonical worked example comes from EBS on sources/2025-03-18-planetscale-the-real-failure-rate-of-ebs:
- 1 gp3 volume: expected to be under 90% of provisioned IOPS 1% of the time (14 min/day).
- 256-shard database × (1 primary + 2 replicas) = 768 gp3 volumes.
- Under the PlanetScale-post assumptions (50% application- tolerance, 10-min event length, 1%–89% uniform severity), per-volume probability of being in an impacting window at any given moment is small — but across 768 volumes: "there is a 99.65% chance you have at least one node experiencing a production-impacting event at any given time."
- io2 fleet of the same size: "roughly one third of the time in any given year" you'd expect the database to be in a failure condition.
The operational consequence¶
If the fleet-wide probability of an active event is ~1, then the operational question shifts:
- No longer: "will we have an incident today?" (yes, somewhere)
- New question: "how fast can we detect and mitigate, and how bounded is the impact window?"
That's exactly the framing that drives patterns/automated-volume-health-monitoring + patterns/zero-downtime-reparent-on-degradation: clamp the impact-window length, because eliminating the events is not on the menu.
Structural patterns¶
The multiplier can be flattened in three directions:
- Reduce
pstructurally. Switch to a substrate with a lower per-component failure probability — e.g. direct-attached NVMe instead of network-attached block storage. See systems/planetscale-metal + patterns/direct-attached-nvme-with-replication. - Reduce
N. Cut the fleet size via larger nodes + fewer shards. Works up to a point (CPU / memory bottlenecks) but caps the application's scale. - Break independence positively. Fate-sharing so that one
failure mitigates many upstream symptoms
(e.g. cluster-aware load balancing that drains degraded
replicas). Doesn't reduce
p × Nbut shortens the impact window.
Correlated failure makes the multiplier worse, not better — see concepts/correlated-ebs-failure for the EBS-AZ instance.
Sibling wiki primitives¶
- concepts/tail-latency-at-scale — the latency-percentile version: as fan-out grows, a single-digit-percent tail becomes near-certain on the aggregated response time. This page is the failure-probability version: as fleet grows, a single-digit-percent SLO window becomes near- certain at the fleet level.
- concepts/correlated-failure — positive-correlation term that makes the independent-event formula an under-estimate.
- concepts/noisy-neighbor — often the underlying per-
component
pat the storage or compute layer.
Canonical arithmetic¶
From the PlanetScale post (paraphrased + annotated):
Assumptions per individual volume:
- Event rate: uniform over time, lasts 10 minutes each.
- Severity: uniform 1%–89% of provisioned IOPS lost.
- Application tolerance: 50% of throughput.
=> ~43 events/month total, ~21 cross the threshold.
Let p = per-volume probability of being in an impacting window
at any given moment. From 21 impacting events/month × 10 minutes
p ≈ 21 × 10 / (30 × 24 × 60) ≈ 0.486%.
Fleet of N = 768 volumes:
P(at least one impacting) = 1 − (1 − p)^768
≈ 1 − (0.99514)^768
≈ 99.7%.
Matches the post's "99.65% chance" within rounding.
Caveats¶
- The per-volume arithmetic is pedagogical under the stated assumptions. Real EBS event distribution is not uniform over time or severity; PlanetScale doesn't publish measured distributions.
- The formula assumes independence between volumes. On EBS, correlated-AZ-failure (see concepts/correlated-ebs-failure) makes events positively correlated, which reduces the count of distinct impacting events but usually extends the impact window per event. Net effect on user-perceived availability is workload-specific.
- Application tolerance is the most important knob. A database that tolerates 80% throughput loss before erroring sees a much smaller fraction of events as impacting; a database that errors at 30% throughput loss sees a larger fraction. The 99.65% figure assumes 50%.
- The pattern is not EBS-specific. It applies to any large fleet with a non-zero per-component failure rate — instance types, database shards, Lambda invocations, NVMe drives, network links, certificates, DNS entries. EBS is the canonical wiki instance because the gp3 SLO gives a clean arithmetic floor.
Seen in¶
- sources/2025-03-18-planetscale-the-real-failure-rate-of-ebs — canonical wiki instance. 768 gp3 volumes → 99.65%; same fleet on io2 → ~33% of the year in a failure condition; correlated-AZ-failure extends the multiplier further.
Related¶
- concepts/performance-variance-degradation
- concepts/tail-latency-at-scale
- concepts/correlated-ebs-failure
- concepts/slow-is-failure
- concepts/correlated-failure
- concepts/noisy-neighbor
- systems/aws-ebs
- systems/planetscale-metal
- patterns/automated-volume-health-monitoring
- patterns/zero-downtime-reparent-on-degradation
- patterns/shared-nothing-storage-topology