CONCEPT Cited by 1 source

EC2 launch failure mode¶

Definition¶

The EC2 launch failure mode is the operational fault class in which running EC2 instances continue to serve traffic normally but RunInstances (or equivalent launch API calls) cannot succeed in an affected region, so any workflow that implicitly provisions a new instance fails or queues until capacity returns.

It is a control-plane-ish failure with data-plane consequences: the already-running data plane is fine, but every process that depends on adding instances at runtime is broken until the regional EC2 control plane can satisfy new launches.

Why it matters in design¶

Most availability models reason about instance loss — what happens when a running VM dies. EC2 launch failure is the mirror-image fault: no loss of running capacity, but no ability to grow.

Three categories of workflow become part of the blast radius:

Reactive autoscalers that add capacity under load. Any reactive or predictive autoscaler that launches instances in response to load cannot satisfy its target during the event. The fleet holds its current size through the outage; peak traffic hits with whatever size the fleet had when launches stopped.
Replacement-launch workflows. ASG health replacement, bad-host auto-drain, fleet-drain-and-retire pipelines, rolling deploys that launch a new instance before terminating the old one — all queue or fail.
Workflows that implicitly launch an instance. This is the subtle category: any operation whose mechanism includes "launch a fresh EC2 and do X on it" becomes part of the blast radius even if the operator thought of it as a data-plane operation. Canonical example from PlanetScale 2025-10-20: backup. PlanetScale's standard backup procedure "launches an additional replica which restores the previous backup and catches up on replication before taking a new backup to avoid reducing the capacity and fault-tolerance of the database during backups." Every scheduled backup during the event needed an EC2 that couldn't be launched.

Canonical PlanetScale framing¶

Verbatim from the 2025-10-20 post-mortem:

Customers could attempt to create or resize database branches but, because we could not launch new EC2 instances, these requests could not be completed; they remained queued until the incident was resolved. Their existing MySQL or Postgres servers remained available while requests to launch new EC2 instances were queued.

The key distinction is crisp: "existing servers remained available while requests to launch new EC2 instances were queued." Running capacity is OK; growing capacity is not.

Contrast with more familiar fault modes¶

Failure	Running capacity	New capacity
AZ outage	Instances in AZ lost	Possible in surviving AZs
EC2 launch failure	OK	Blocked regionally
Regional outage	All lost	All blocked
Account / limit exhaustion	OK	Blocked until limit raised
Control-plane outage	OK	API-dependent — may be blocked for all mutating ops

EC2-launch-failure is specifically the middle row: a surgical asymmetry where the steady-state is intact but elastic growth is gone.

Operational implications¶

The architectural response converges on three moves:

Conserve what you have. Stop routine churn (see patterns/suspend-routine-capacity-churn-during-dependency-outage): pause 30-day drain-and-terminate cycles, hold vacated instances for reuse, cancel pending backups that would need a new EC2.
Densify existing capacity. Bin-pack tighter than steady state to cover expected peak load (see patterns/conservative-capacity-bin-packing-during-incident).
Shed elastic demand. Redirect new-resource creation to unaffected regions; advise customers with diurnal autoscalers to shed non-critical load.

Distinguishing from capacity exhaustion¶

Capacity exhaustion (InsufficientInstanceCapacity) is the account's fault — usually because the requested instance type isn't available in the requested AZ at that moment. Heterogeneous instance-type pools (e.g. Karpenter's multi-SKU NodePools) mitigate it.
EC2 launch failure mode as documented in the 2025-10-20 incident is region-wide: it's not about SKU availability but about the regional launch path being broken upstream. No amount of SKU flexibility helps; the launch itself can't complete.

Seen in¶

sources/2025-11-03-planetscale-aws-us-east-1-incident-2025-10-20 — PlanetScale, Richard Crowley, 2025-11-03. Canonical wiki disclosure of this fault class. Phase 2 of the 2025-10-20 AWS us-east-1 incident: ~10 h 27 min window during which PlanetScale could not launch new EC2 instances in us-east-1. Running MySQL / Postgres servers remained available; database creation, branch resize, and the launch-a-replica backup workflow queued. Names the three workflow categories above and the operator-side response (redirect default region, freeze drain-and-terminate, hold vacated instances, delay + cancel pending backups, bin-pack vtgate tighter).

systems/aws-ec2 — the substrate whose launch API is the failure boundary.
concepts/control-plane-impact-without-data-plane-impact — the clean outcome-shape from phase 1 of the same incident; this concept is the phase-2 variant where data plane is partially affected because some data-plane workflows need fresh EC2.
concepts/control-plane-data-plane-separation — the design principle being tested.
concepts/blast-radius — what this fault class enlarges beyond naive expectation.
patterns/suspend-routine-capacity-churn-during-dependency-outage — response pattern.
patterns/conservative-capacity-bin-packing-during-incident — response pattern.
patterns/shed-load-during-capacity-shortage — response pattern.
concepts/diurnal-autoscaling-risk — the amplifier when this fault class coincides with a scheduled scale-up window.