Skip to content

CONCEPT Cited by 1 source

EC2 launch failure mode

Definition

The EC2 launch failure mode is the operational fault class in which running EC2 instances continue to serve traffic normally but RunInstances (or equivalent launch API calls) cannot succeed in an affected region, so any workflow that implicitly provisions a new instance fails or queues until capacity returns.

It is a control-plane-ish failure with data-plane consequences: the already-running data plane is fine, but every process that depends on adding instances at runtime is broken until the regional EC2 control plane can satisfy new launches.

Why it matters in design

Most availability models reason about instance loss — what happens when a running VM dies. EC2 launch failure is the mirror-image fault: no loss of running capacity, but no ability to grow.

Three categories of workflow become part of the blast radius:

  1. Reactive autoscalers that add capacity under load. Any reactive or predictive autoscaler that launches instances in response to load cannot satisfy its target during the event. The fleet holds its current size through the outage; peak traffic hits with whatever size the fleet had when launches stopped.
  2. Replacement-launch workflows. ASG health replacement, bad-host auto-drain, fleet-drain-and-retire pipelines, rolling deploys that launch a new instance before terminating the old one — all queue or fail.
  3. Workflows that implicitly launch an instance. This is the subtle category: any operation whose mechanism includes "launch a fresh EC2 and do X on it" becomes part of the blast radius even if the operator thought of it as a data-plane operation. Canonical example from PlanetScale 2025-10-20: backup. PlanetScale's standard backup procedure "launches an additional replica which restores the previous backup and catches up on replication before taking a new backup to avoid reducing the capacity and fault-tolerance of the database during backups." Every scheduled backup during the event needed an EC2 that couldn't be launched.

Canonical PlanetScale framing

Verbatim from the 2025-10-20 post-mortem:

Customers could attempt to create or resize database branches but, because we could not launch new EC2 instances, these requests could not be completed; they remained queued until the incident was resolved. Their existing MySQL or Postgres servers remained available while requests to launch new EC2 instances were queued.

The key distinction is crisp: "existing servers remained available while requests to launch new EC2 instances were queued." Running capacity is OK; growing capacity is not.

Contrast with more familiar fault modes

Failure Running capacity New capacity
AZ outage Instances in AZ lost Possible in surviving AZs
EC2 launch failure OK Blocked regionally
Regional outage All lost All blocked
Account / limit exhaustion OK Blocked until limit raised
Control-plane outage OK API-dependent — may be blocked for all mutating ops

EC2-launch-failure is specifically the middle row: a surgical asymmetry where the steady-state is intact but elastic growth is gone.

Operational implications

The architectural response converges on three moves:

Distinguishing from capacity exhaustion

  • Capacity exhaustion (InsufficientInstanceCapacity) is the account's fault — usually because the requested instance type isn't available in the requested AZ at that moment. Heterogeneous instance-type pools (e.g. Karpenter's multi-SKU NodePools) mitigate it.
  • EC2 launch failure mode as documented in the 2025-10-20 incident is region-wide: it's not about SKU availability but about the regional launch path being broken upstream. No amount of SKU flexibility helps; the launch itself can't complete.

Seen in

  • sources/2025-11-03-planetscale-aws-us-east-1-incident-2025-10-20 — PlanetScale, Richard Crowley, 2025-11-03. Canonical wiki disclosure of this fault class. Phase 2 of the 2025-10-20 AWS us-east-1 incident: ~10 h 27 min window during which PlanetScale could not launch new EC2 instances in us-east-1. Running MySQL / Postgres servers remained available; database creation, branch resize, and the launch-a-replica backup workflow queued. Names the three workflow categories above and the operator-side response (redirect default region, freeze drain-and-terminate, hold vacated instances, delay + cancel pending backups, bin-pack vtgate tighter).
Last updated · 550 distilled / 1,221 read