Skip to content

PLANETSCALE 2025-11-03

Read original ↗

PlanetScale — Report on our investigation of the 2025-10-20 incident in AWS us-east-1

Richard Crowley (PlanetScale, 2025-11-03). Post-mortem of the October 20, 2025 AWS us-east-1 incident as it impacted PlanetScale. The incident ran through two distinct phases with different failure shapes: Phase 1 was a control-plane outage driven by a DNS misconfiguration in an upstream SaaS provider that cascaded through S3 → STS → DynamoDB dependencies; Phase 2 was an EC2 capacity crunch plus partial cross-AZ network partitions that affected a subset of customer databases. Canonical for the wiki because it is a rare first-person disclosure of a production control-plane / data-plane separation surviving a real upstream outage — PlanetScale's design let databases stay up while its own control plane was dark.

TL;DR / Summary

  • Timeline: 07:13 UTC phase-1 alerts → 09:30 UTC phase-1 resolved → 10:05 UTC phase-2 alerts (EC2-launch failures) → 14:30 UTC network partitions observed → 18:30–19:30 UTC partitions heal → 20:32 UTC incident commander declares resolved.
  • Phase 1 (control plane): PlanetScale control plane depends on internal secret-distribution → Amazon S3AWS STSDynamoDB. When the upstream DNS misconfiguration took DynamoDB down, the chain broke. Customer databases were unaffected — the data plane kept serving queries. Dashboard (hosted on a SaaS provider also in us-east-1) and SSO were intermittent; the status page itself was unavailable for at least 30 minutes.
  • Phase 2 (capacity + network): PlanetScale could not launch new EC2 instances in us-east-1. Requests to create / resize databases queued; existing MySQL / Postgres servers kept running. Diurnal autoscalers for vtgate (which ramp up before US East Coast peak) became the main risk surface.
  • Operator interventions (phase 2): disallow new databases in us-east-1 (re-default to us-east-2); delay + cancel pending backups (backup is a launch-EC2-and-restore flow); advise customers with vtgate autoscaling to shed load (pause ETLs, delay queues); pause the continuous drain-and-terminate loop for >30-day instances; stop terminating vacant EC2 instances (hold for reuse); and — most important — temporarily bin-pack vtgate processes tighter than usual, running closer to CPU capacity to provide peak capacity without launching.
  • Partial partitions: 14:30–19:30 UTC, some databases were reachable from the Internet but could not communicate cross-AZ for query routing or replication; some replicas could reach container registries but could not replicate from their primary; some had internal-DNS resolution failures. Manual reparents moved primaries to healthier AZs or co-located with the customer's app.
  • Recovery long tail: a small number of edge load-balancer and vtgate processes did not self-recover after partitions healed and had to be restarted manually.

Key takeaways

1. Control-plane / data-plane separation held

PlanetScale frames this in its existing Principles of Extreme Fault Tolerance ("isolation and static stability"). The test of that principle was whether a total internal control-plane outage would touch customer databases. It did not: "Throughout this period, no database branches lost capacity or connectivity." This is the canonical wiki example of control-plane impact without data-plane impact — see also the generalised control plane / data plane separation page.

2. SaaS + cloud-service dependency chain was the control plane's single point of failure

The control-plane failure chain was: "The service responsible for creating, resizing, and configuring database branches, which is hosted in AWS us-east-1, was unavailable. It depends on our internal secret-distribution service which depends on Amazon S3 which depends on AWS STS which was impacted by the Amazon DynamoDB outage." Verbatim dependency graph:

new-branch/resize/config service
  └─ internal secret-distribution service
        └─ Amazon S3
              └─ AWS STS
                    └─ Amazon DynamoDB   ← DNS misconfig here

Four transitive hops between the PlanetScale control plane and the thing that actually broke. Canonicalised as runtime dependency on SaaS provider; extends systems/aws-s3 (consumer-side dependency surface), systems/aws-sts (inside the S3 dependency), systems/dynamodb (first wiki ingest recording its outage as the 2025-10-20 originating event).

PlanetScale's remediation commitment (verbatim): "We are taking steps to better understand and become resilient to the failure modes of SaaS we depend on, including for CI/CD, SSO, Web application hosting and incident communication. We are investigating more ambitious ways to reduce our runtime dependence on both internal and AWS services."

3. EC2-launch failure is a distinct failure mode from EC2 running

Phase 2 canonicalises — for the first time in the wiki — the operational distinction that "inability to launch new instances" is a separate surface from "running instances broken." Verbatim: "Customers could attempt to create or resize database branches but, because we could not launch new EC2 instances, these requests could not be completed; they remained queued until the incident was resolved. Their existing MySQL or Postgres servers remained available while requests to launch new EC2 instances were queued."

The operational implication: every workflow that implicitly launches an instance becomes part of the blast radius. At PlanetScale, three workflows were affected:

  1. New database branch creation (launches new EC2).
  2. Branch resize (launches new EC2 at the new size).
  3. Backup — PlanetScale's backup procedure "launches an additional replica which restores the previous backup and catches up on replication before taking a new backup to avoid reducing the capacity and fault-tolerance of the database during backups."

Canonicalised as EC2 launch failure mode; see also suspend routine capacity churn during dependency outage for the response pattern.

4. Diurnal autoscaling was the specific risk amplifier

Verbatim framing: "Given that the US East Coast was about to start their Monday, the inability to launch new EC2 instances presented a risk to some of our largest customers who use diurnal autoscaling for the vtgate component of their Vitess clusters. Some were going to be coming into their peak weekly traffic with less than half the vtgate capacity they had the week prior."

Three things make this canonical:

  1. It names which component autoscalessystems/vtgate, the Vitess stateless query router — not the database itself. (MySQL primaries don't scale-out; vtgate is the elastic tier.)
  2. It names the time correlation: incident coincides with the pre-peak scale-up window → the autoscaler wants to double capacity in a window where launching any new instance is impossible.
  3. It quantifies the risk: "less than half the vtgate capacity they had the week prior."

Canonicalised as diurnal autoscaling risk.

5. Operator response: mint inventory from what's already running

With no ability to add capacity, PlanetScale's playbook converged on conserving and densifying existing capacity:

  • Freeze the 30-day drain-and-terminate loop. Normally PlanetScale continuously drains EC2 instances older than 30 days and terminates them (presumably to cycle instance generations). That entire loop was paused. (Pattern: patterns/suspend-routine-capacity-churn-during-dependency-outage.)
  • Hold terminating-vacant instances for reuse instead of terminating.
  • Delay scheduled backups + cancel pending backups waiting to launch a replica. (Pattern: patterns/shed-load-during-capacity-shortage.)
  • Redirect new databases to us-east-2 (change default region, disallow creation in us-east-1).
  • Most important: bin-pack vtgate processes tighter than normal. Verbatim: "we bin-packed vtgate processes more tightly than usual, running closer to CPU capacity than is typical, in order to provide ample capacity for the US work day." This is canonical for conservative capacity bin-packing during incident — trade headroom for peak coverage, reverse when launch capability returns.

6. Partial network partitions are a genuinely distinct failure

Verbatim: "Some database servers were reachable from the Internet but couldn't communicate across availability zones for query routing, replication, or both. Some replicas could reach container registries when they started up but could not replicate from their primary MySQL or Postgres. Some servers had trouble resolving internal DNS names and others had trouble connecting to the internal services those DNS names resolved."

Three distinct manifestations of a partial partition in a single paragraph:

  1. Internet reachable + cross-AZ unreachable (split between public and private-network fabrics).
  2. Container registry reachable + primary unreachable (split between two private-network destinations).
  3. Internal-DNS split (DNS resolves fine in some paths, fails in others).

PlanetScale's mitigation where possible: "we manually sent reparent requests to move primary databases to availability zones known to be healthier or known to be colocated with the customer's application." Canonicalised as zonal reparenting to healthy AZ — reactive, operator-driven, not automatic.

7. Some processes did not self-recover after partitions healed

Verbatim: "Once the network partitions healed, we found a small number of processes (PlanetScale's edge load balancer as well as vtgate) which were not able to recover on their own due to the way they experienced the network partition. We restarted these and restored service."

The generalisation: surviving a partition is not the same as recovering from one. Some processes acquire state during a partition (stuck TCP connections, stale DNS caches, confused leader-election state) that only a restart clears. Worth distinguishing from split-brain, which is a disagreement problem; this is a stuck-connection problem. No new page for this yet — noted here as a candidate concept if it recurs.

8. AZ topology observation: us-east-1 has six AZs

Closing paragraph verbatim: "Per AWS's Well-Architected Framework, the use of three availability zones allows us to tolerate the failure of one but only if network connectivity between the other two remains reliable. AWS us-east-1 happens to have six availability zones and we're looking into how PlanetScale can better use them all to become more resilient to both zonal outages and network partitions between them."

Two canonical points: (1) the 3-AZ quorum design tolerates a single AZ failure only if the remaining two AZs stay connected — partial partitions between two of three AZs break the assumption; (2) using more than 3 AZs when available changes the probability surface. This is a future-direction remark (no specific new design surfaced in the post), noted to extend concepts/availability-zone-balance.

Numbers

  • Incident duration, phase 1: 07:13–09:30 UTC (~2 h 17 min).
  • Gap between phases: ~35 minutes (09:30 → 10:05 UTC).
  • Incident duration, phase 2: 10:05–20:32 UTC (~10 h 27 min).
  • Network partition window: ~14:30–19:30 UTC (~5 hours), with gradual healing 18:30–19:30 UTC.
  • Total incident duration: ~13 hours 19 minutes (07:13 → 20:32 UTC).
  • Status page unavailable for at least 30 minutes during phase 1.
  • Customers on diurnal autoscaling risked <50% of prior-week vtgate capacity entering Monday peak.
  • us-east-1 AZ count: 6 (most AWS regions have 3; this is a forward-looking number for PlanetScale's resilience roadmap).
  • Dependency chain depth control-plane → DynamoDB: 4 hops.

Architectural disclosures

Control-plane dependency graph (verbatim reconstruction)

  • Branch creation / resize / config service (in us-east-1)
  • → Internal secret-distribution service
  • Amazon S3
  • AWS STS
  • Amazon DynamoDB ← DNS misconfig originated here

What kept working during phase 1

  • All customer database branches (the data plane).
  • Existing VTGate / VTTablet routing.
  • Existing replication topology.
  • (Likely) credential / secret material already cached inside running data-plane processes — implied by S3 + STS being the bottleneck for new config operations, not for in-flight traffic.

What didn't work during phase 1

  • New database creation / resize / config changes.
  • PlanetScale dashboard (intermittent; hosted by a SaaS provider itself in us-east-1).
  • SSO logins (for customers not already logged in).
  • Updates to https://planetscalestatus.com (and the site itself, for >30 min).

What didn't work during phase 2

  • New EC2 instance launches in us-east-1 (platform-wide).
  • Therefore: database creation, branch resize, and backup (which launches a replica-restore EC2).
  • Cross-AZ query routing + replication for a subset of databases (14:30–19:30 UTC).
  • Some edge load-balancer + vtgate processes stuck in post-partition state, requiring manual restart.

Caveats

  • Vendor-authored post-mortem. PlanetScale-favourable framing throughout; the "isolation and static stability worked" claim is grounded in what the post discloses but the post is not an independent third-party account.
  • Upstream SaaS provider left unnamed. The DNS misconfiguration is attributed to "one of PlanetScale's service providers" and the dashboard-hosting provider is "a provider that, like the PlanetScale control plane, is hosted in AWS us-east-1" — not identified. AWS-side root-cause reporting for the 2025-10-20 DynamoDB outage is external to this post.
  • No quantified customer impact during phase 2 partitions. "The network partitions caused a significant percentage of some customers' queries to fail. Not all database branches were affected..." — no numbers on affected-customer count, query- failure rate, or per-customer duration.
  • No disclosure of how vtgate bin-packing was implemented. Operator-triggered scheduler hint? Manual pod edits? Emergency deployment? — not said.
  • No disclosure of how many EC2 instances were held vs drained during the 30-day-drain-pause window. The operational load of pausing that loop (and the cost of not draining) isn't sized.
  • Tier-3 on-scope. PlanetScale is Tier 3 in the wiki classifier; this post is a production post-mortem with architectural substance (control-plane dependency chain, 6-AZ resilience roadmap, bin-packing during incident, autoscaling failure mode), which clears the scope filter decisively — first-person narratives of real incidents are among the highest-signal posts in the corpus.

Source

Last updated · 470 distilled / 1,213 read