Skip to content

CONCEPT Cited by 1 source

Blast-radius multiplier at fleet scale

Definition

Blast-radius multiplier at fleet scale is the statistical reality that, as the number of components in a fleet grows, the probability of at least one active impacting event tends rapidly toward 1 — even when any single component's failure probability is small. For a fleet of N components, each with independent probability p of being in a failing state, the probability of at least one failing is 1 − (1 − p)^N.

The wiki's canonical worked example comes from EBS on sources/2025-03-18-planetscale-the-real-failure-rate-of-ebs:

  • 1 gp3 volume: expected to be under 90% of provisioned IOPS 1% of the time (14 min/day).
  • 256-shard database × (1 primary + 2 replicas) = 768 gp3 volumes.
  • Under the PlanetScale-post assumptions (50% application- tolerance, 10-min event length, 1%–89% uniform severity), per-volume probability of being in an impacting window at any given moment is small — but across 768 volumes: "there is a 99.65% chance you have at least one node experiencing a production-impacting event at any given time."
  • io2 fleet of the same size: "roughly one third of the time in any given year" you'd expect the database to be in a failure condition.

The operational consequence

If the fleet-wide probability of an active event is ~1, then the operational question shifts:

  • No longer: "will we have an incident today?" (yes, somewhere)
  • New question: "how fast can we detect and mitigate, and how bounded is the impact window?"

That's exactly the framing that drives patterns/automated-volume-health-monitoring + patterns/zero-downtime-reparent-on-degradation: clamp the impact-window length, because eliminating the events is not on the menu.

Structural patterns

The multiplier can be flattened in three directions:

  1. Reduce p structurally. Switch to a substrate with a lower per-component failure probability — e.g. direct-attached NVMe instead of network-attached block storage. See systems/planetscale-metal + patterns/direct-attached-nvme-with-replication.
  2. Reduce N. Cut the fleet size via larger nodes + fewer shards. Works up to a point (CPU / memory bottlenecks) but caps the application's scale.
  3. Break independence positively. Fate-sharing so that one failure mitigates many upstream symptoms (e.g. cluster-aware load balancing that drains degraded replicas). Doesn't reduce p × N but shortens the impact window.

Correlated failure makes the multiplier worse, not better — see concepts/correlated-ebs-failure for the EBS-AZ instance.

Sibling wiki primitives

  • concepts/tail-latency-at-scale — the latency-percentile version: as fan-out grows, a single-digit-percent tail becomes near-certain on the aggregated response time. This page is the failure-probability version: as fleet grows, a single-digit-percent SLO window becomes near- certain at the fleet level.
  • concepts/correlated-failure — positive-correlation term that makes the independent-event formula an under-estimate.
  • concepts/noisy-neighbor — often the underlying per- component p at the storage or compute layer.

Canonical arithmetic

From the PlanetScale post (paraphrased + annotated):

Assumptions per individual volume:
  - Event rate: uniform over time, lasts 10 minutes each.
  - Severity: uniform 1%–89% of provisioned IOPS lost.
  - Application tolerance: 50% of throughput.
  => ~43 events/month total, ~21 cross the threshold.

Let p = per-volume probability of being in an impacting window
at any given moment. From 21 impacting events/month × 10 minutes
  p ≈ 21 × 10 / (30 × 24 × 60) ≈ 0.486%.

Fleet of N = 768 volumes:
  P(at least one impacting) = 1 − (1 − p)^768
                            ≈ 1 − (0.99514)^768
                            ≈ 99.7%.

Matches the post's "99.65% chance" within rounding.

Caveats

  • The per-volume arithmetic is pedagogical under the stated assumptions. Real EBS event distribution is not uniform over time or severity; PlanetScale doesn't publish measured distributions.
  • The formula assumes independence between volumes. On EBS, correlated-AZ-failure (see concepts/correlated-ebs-failure) makes events positively correlated, which reduces the count of distinct impacting events but usually extends the impact window per event. Net effect on user-perceived availability is workload-specific.
  • Application tolerance is the most important knob. A database that tolerates 80% throughput loss before erroring sees a much smaller fraction of events as impacting; a database that errors at 30% throughput loss sees a larger fraction. The 99.65% figure assumes 50%.
  • The pattern is not EBS-specific. It applies to any large fleet with a non-zero per-component failure rate — instance types, database shards, Lambda invocations, NVMe drives, network links, certificates, DNS entries. EBS is the canonical wiki instance because the gp3 SLO gives a clean arithmetic floor.

Seen in

Last updated · 319 distilled / 1,201 read