CONCEPT Cited by 12 sources
Blast radius¶
What it is¶
Blast radius is the scope of damage that a single fault — bug, misconfiguration, vulnerability, runaway workload, compromised credential — can reach before isolation mechanisms contain it. System design choices trade off blast-radius size against operational complexity and cost:
- Smaller blast radius ⇒ safer failure modes, but more boundaries to provision/maintain/monitor.
- Larger blast radius ⇒ simpler ops, cheaper substrate, but a single fault can affect many customers / regions / products.
The term is the concrete unit architects use when discussing isolation trade-offs: "what's the blast radius of this change?" is the design-review heuristic for rolling out anything risky.
Load-bearing in account-per-tenant isolation¶
One of the five benefits ProGlove cites for account-per-tenant:
"An accidental misconfiguration or vulnerability could expose multiple tenants."
…is a blast-radius framing. The AWS account boundary bounds the blast radius of any per-tenant fault to exactly one tenant's data + compute + IAM + network. In contrast, the shared-account shape makes the blast radius the whole tenant fleet. (Source: sources/2026-02-25-aws-6000-accounts-three-people-one-platform)
Named blast-radius boundaries in the wiki¶
Roughly ordered by decreasing size:
- AWS partition — sovereignty / jurisdictional boundary; concepts/aws-partition.
- AWS Region — availability / geographic boundary.
- AWS Availability Zone — DC-fault boundary.
- AWS account — security + quota + billing + IAM boundary. (concepts/account-per-tenant-isolation)
- Kubernetes cluster / EC2 fleet / service deployment — operational-unit boundary. (concepts/active-multi-cluster-blast-radius, patterns/multi-cluster-active-active-redundancy)
- Tenant — authorization boundary inside a shared environment. (concepts/tenant-isolation)
- Database shard — storage-tier failure-domain boundary; concepts/sharded-failure-domain-isolation. A shard's outage caps customer-impact to ~1/N of the fleet under uniform distribution. One-step-below cluster-level isolation and one-step-above tenant-level authorization.
- Request — the unit a single fault might affect.
Related framings in the wiki¶
- concepts/active-multi-cluster-blast-radius — same concept at the service/cluster level: running services across multiple active clusters so that any single cluster failure doesn't take the whole service down.
- concepts/performance-isolation — performance-domain cousin; "how large a perf regression can one tenant cause for another?"
- patterns/cross-partition-failover — blast-radius response at the largest possible scope (partition-level human-disaster containment).
Caveats¶
- Blast radius ≠ 0 is not always the goal. Making the radius smaller costs more (more accounts, more clusters, more separate fleets). The right size is set by regulatory requirements, customer commitments, and economic feasibility, not a universal "smaller is better."
- Structural boundaries vs application-level boundaries. An
account boundary is load-bearing whether or not the application
is multi-tenant-aware; a per-request
tenant_idcheck is only load-bearing if the code path actually enforces it. Structural boundaries are the ones that survive bugs. - Cross-boundary primitives re-introduce risk. Anything that works "across" a blast-radius boundary — cross-account IAM roles, multi-region replication, shared admin tooling — lives outside the boundary and must be hardened separately. ProGlove's explicit call-out: "Observability tooling should be centralized, but without reintroducing the very risks that accounts are meant to isolate."
Seen in¶
-
sources/2025-06-20-redpanda-behind-the-scenes-redpanda-clouds-response-to-the-gcp-outage — canonical wiki instance of cell-based architecture as an explicit blast-radius reduction strategy. Redpanda's single-binary broker + per-customer cluster deployment bounds the blast radius of any fault to one customer's cluster; the 2025-06-12 GCP global outage validated the design — "Out of hundreds of clusters, we were lucky that only one cluster was affected." Explicit contrast with "centralized metadata and diskless" streaming architectures that "likely experienced the full weight of this global outage."
-
— Canonical wiki application of blast-radius containment to the webhook-sender service tier (Mike Coutermarsh, 2023-11-21, re-fetched 2026-04-21). PlanetScale runs its webhook sender as a dedicated Kubernetes service on isolated machines so webhook-specific abuse (SSRF attempts, amplification floods, slow-receiver resource-tie-up) is structurally bounded to the webhook tier. Verbatim: "In the event that our other mitigations fail, we run our webhooks queue on isolated machines to protect against webhooks impacting the availability of other PlanetScale services. … If our webhooks are being abused, we do not want that to impact the reliability of the rest of our systems. They can be easily paused or disabled in the event of an incident." Adds the isolated-service-per- failure-mode rung to the blast-radius ladder at the Kubernetes-service altitude — one step below fleet/cluster and one step above tenant. Composes with patterns/isolated-egress-proxy-for-user-urls and patterns/defense-in-depth-webhook-abuse-mitigation.
-
— Brian Morrison II (PlanetScale, 2023-11-20) canonicalises horizontal sharding as a database-tier blast-radius primitive. The sharded-failure-domain framing: "in sharded environments, this failure domain is actually spread out. [...] If shard A goes down, it will make a bad day for customers 1-5, but the remaining shards are actually still online and can serve data with no problem." Adds the shard-rung to the blast-radius ladder above — a structural boundary below cluster and above tenant. Introduces the "two is one, and one is none" infrastructure adage as the load-bearing motivation for putting a shard boundary between customer and database. Also names the revenue-loss-containment corollary: a 1/N availability blast radius implies a ~1/N revenue-loss blast radius under uniform distribution. See concepts/sharded-failure-domain-isolation for the full concept page.
-
sources/2026-02-25-aws-6000-accounts-three-people-one-platform — ProGlove's primary framing for the account-per-tenant architectural choice; blast-radius containment is one of five named benefits.
- sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty — partition-level blast radius (human-disaster / geopolitical containment) addressed by patterns/cross-partition-failover.
- sources/2025-09-25-mongodb-carrying-complexity-delivering-agility — MongoDB's dedicated-cluster framing of architectural isolation is explicitly a blast-radius containment claim: "with a MongoDB Atlas dedicated cluster… the blast radius of a problem elsewhere stops at your door." The anti-shared-wall stance is a boundary-per-tenant-at-VM-and-VPC rung on the blast-radius ladder, one step below account-per-tenant.
- sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system — blast-radius reasoning stacked at two altitudes simultaneously: (1) concepts/shuffle-sharding inside a cluster bounds any single tenant's damage to their K-node shuffle set; (2) multi- cluster architecture with dedicated clusters for specialised workloads (compute / mesh / application tiers) bounds cluster-scoped failures to 1/N of the fleet. Rollout ordering (patterns/progressive-cluster-rollout) bounds the blast radius of deploy regressions along a criticality axis — test → internal → application → infrastructure clusters, with infra last so a regression doesn't cause "flying blind".
- sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption
— Blast-radius reasoning at the fleet-configuration-
management substrate altitude. Slack's phase-2 Chef rollout
splits a single
prodChef environment into six AZ-bucketed environments (prod-1…prod-6), mapping each new instance to its AZ's bucket at boot via Poptart Bootstrap. The load- bearing insight is that per-node cron staggering doesn't protect the newly-provisioned-nodes axis: "any newly provisioned nodes would immediately pick up the latest (possibly bad) changes from that shared environment. This became a significant reliability risk, especially during large scale-out events, where dozens or hundreds of nodes could start up with a broken configuration." The AZ-bucketed split is a cell-based-architecture instantiation at the configuration-management altitude — see concepts/az-bucketed-environment-split and patterns/split-environment-per-az-for-blast-radius. The post also canonicalises the architectural ceiling of the pattern (per-service isolation) as the motivation for building Shipyard as the legacy EC2 platform's successor. - sources/2024-12-12-stripe-the-secret-life-of-dns-packets-investigating-complex-networks — Stripe's central DNS-server cluster as a canonical choke-point at infrastructure boundary failure mode. The AWS VPC resolver's 1,024-pps-per-ENI rate limit turns any small cluster of ENIs that fronts DNS for a large fleet into a hard saturation ceiling; a Hadoop reverse-DNS workload filled it every hour. The fix is a blast-radius-reduction move at the topological altitude: distribute the DNS-forwarding workload off a few central ENIs onto every application host's local Unbound, so one ENI's saturation is localised rather than fleet-wide. See patterns/distribute-dns-load-to-host-resolver.
-
concepts/active-multi-cluster-blast-radius — related cluster-level shape.
-
sources/2024-01-22-zalando-tale-of-metadpata-the-revenge-of-the-supertools — Blast-radius framing at the destructive-automation altitude. Zalando's 2024-01 postmortem of a November 2022 DNS outage names a new class — the supertool: an application with fleet-wide destructive authority that runs on a normal- looking config change. A single
p-typo (metadata→metadpata) collapsed an account-lifecycle job's account-in-scope set to empty, which its decommission logic interpreted as "all accounts", triggering Route 53 hosted-zone deletion fleet-wide. The 5-layer containment stack Zalando shipped after the incident is a blast-radius-reduction recipe: scream test (1-week reversible Network ACL + DNS delegation removal), cost-weighted deletion deferral (low-savings resources excluded from automation entirely), PR preview of CloudFormation ChangeSet (per-account delta visible in review), triple- redundant jsonschema validation (IDE + pre-commit + CI on one schema), and phased rollout across release channels (playground → test → infra → production). Adds a new rung to the blast-radius ladder at the destructive-automation-applied-across- accounts altitude — see concepts/destructive-automation-blast-radius. -
sources/2026-04-24-atlassian-rovo-dev-driven-development — five-lever blast-radius safety net for agent-driven development. The Fireworks post explicitly reframes blast-radius as the load-bearing safety net when manual code review is no longer the primary correctness gate. Five named levers: "CI/CD pipelines (automated quality gate), Sharding (limit the blast radius of any single change), RBAC / JIT access (control who — and what — can write), Progressive rollouts & canary deploys across multiple clusters, AI-written e2e tests (primary validation harness)." Canonical wiki disclosure of sharding + RBAC + JIT + canary rollout as the defence-in-depth stack specifically targeted at LLM-written production code; "dev shards" (patterns/dev-shard-iteration-loop) provide the per-developer blast-radius bulkhead during development, and "canary deploys across multiple clusters" provide the production-rollout bulkhead. patterns/rbac-jit-as-agent-safety-net names the access-control lever explicitly. First wiki framing of blast-radius as the access-control-and-exposure tier that compensates for weakened manual-review gates in agentic development.
-
sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures — Canonical quantified production-AZ-outage instance of cell-as-blast-radius-unit at the serverless-database regional-composition altitude. systems/lakebase composes each region from N identically-shaped cells (each a complete Kubernetes + control plane + compute + storage stack); during the 2026-05-08 AWS us-east-1 thermal-event AZ outage, "one of the cells had issues failing over to healthy nodes. The impact was contained to that cell. The other seven cells in the region failed over correctly, so the incident affected only ~13% of databases in the region. In this case, the cell-based architecture reduced the impact by roughly an order of magnitude." The canonical-instance datums: 8 cells in us-east-1 (~13% impact = ~1/8 = the cell ratio); 7 cells failed over correctly + 1 imperfectly during the AZ event. New rung in the blast-radius ladder at the regional-cell-composition altitude — one step below "AWS Region" (since cells are within-region) and one step above "Kubernetes cluster" (since each cell is a Kubernetes cluster + supporting stack). Distinct from Redpanda's per-customer-cluster cells (customer-bounded) and from AWS Hybrid Multi-Tenant cells (AWS-account-bounded) — Lakebase's cells are fleet-bounded multi-customer cells inside a single regional AWS account. First wiki canonical instance of:
- Quantified production-AZ-outage blast-radius validation on a serverless-database regional architecture.
- Cell as scaling-unit-and-blast-radius-unit jointly — the cell boundary is sized by Kubernetes + control-plane scaling limits, not by a customer / AWS-account boundary.
- Whole-AZ-network-partition drill regime sized to one cell — see concepts/whole-az-network-partition-simulation + patterns/whole-az-network-partition-drill; the cell boundary is what makes the drill safe.