Skip to content

PATTERN Cited by 1 source

Suspend routine capacity churn during dependency outage

Problem

Running fleets continuously perform routine capacity churn — draining and terminating old instances, replacing them with new ones, retiring hosts past a lifespan threshold, recycling container hosts, rotating images. In steady state this churn is net-neutral: for every instance torn down, a replacement is launched.

When an upstream provisioning fault (e.g. EC2 launch failure) breaks the launch path, the churn stops being net-neutral. Each drain- and-terminate cycle becomes a one-way loss: the old instance goes away, the replacement never comes up. The fleet shrinks on a timer for the duration of the outage.

Solution

Pause the routine capacity-churn loop for the duration of the dependency outage. Specifically:

  1. Stop draining and terminating old-but-working instances. Even if they are past the normal retirement threshold, they are serving traffic; keeping them running costs nothing compared to losing their capacity.
  2. Hold vacated instances for reuse instead of terminating them. Instances that would normally be torn down after a workload finishes are kept alive as warm inventory — the next scheduling need can reuse them directly.
  3. Cancel / delay background workflows that implicitly launch new capacity. Backup jobs, rehydration tasks, periodic rebalance passes — anything that calls RunInstances as part of its mechanism.

Verbatim from PlanetScale's 2025-10-20 incident post:

We also took steps to avoid terminating any running EC2 instances:

  • Paused our continuous process of draining and terminating EC2 instances more than 30 days old.
  • Stopped terminating any EC2 instances that became vacant, instead holding them for reuse.

Separately on the workflow side:

Delayed scheduling additional backups and canceled pending backups that were waiting to launch an EC2 instance. (PlanetScale's standard backup procedure launches an additional replica which restores the previous backup and catches up on replication before taking a new backup to avoid reducing the capacity and fault-tolerance of the database during backups.)

Mechanics

Implementation depends on how the churn loop is structured:

  • Scheduled/cron drain loops — disable the schedule (crontab entry, scheduled Lambda, Airflow DAG) for the duration of the incident. Preserve configuration so re-enabling is a single change.
  • Policy-driven drain (age / lifespan-based) — update the policy parameter ("retire after N days") to effectively infinity. Revert when provisioning returns.
  • Termination-on-idle / auto-recycle — suspend the ASG lifecycle hooks that trigger termination; instances become sticky.
  • Workflow-level — identify workflows with implicit RunInstances side effects (backups, migrations, disk resizes) and cancel them at the orchestrator layer.

When this is right

  • Routine churn is purely lifecycle-driven, not health- driven. Stopping drain of old-but-healthy instances is safe because the churn was a preventive maintenance, not a response to an observed problem.
  • The stopped churn has a bounded duration. Pausing the 30-day retirement loop for 12 hours costs nothing; pausing it for 6 months might mean you're running instances that are now 180 days out of date.
  • The outage is not itself caused by old-instance bugs. If you paused churn during an outage that was caused by a bad image on old instances, you'd be making things worse.

When this is wrong

  • Security-critical churn. Patching cadence, credential rotation, compliance-driven instance refresh — these can't be paused for convenience even during an incident.
  • The churn is itself unblocking. If your drain loop is trying to evict a misbehaving instance, pausing it leaves the bad instance alive.
  • Capacity constraints don't apply to the churn's replacements. If the replacements come from a different pool (different region, different account, different provider), the churn isn't eating into the constrained capacity anyway.

Composition

This pattern is the conserve-what-you-have lever of the incident-response playbook. It composes with:

Structural invariant

The governing rule: during a capacity-provisioning outage, treat every running instance as irreplaceable until proven otherwise. Every normal-time safety net that assumes "we can just spin up a new one" is inverted. This applies recursively — the churn-and-replace pattern is cheap only when launches succeed; when they don't, every termination is a decision about keeping the fleet, not about maintenance.

Seen in

  • sources/2025-11-03-planetscale-aws-us-east-1-incident-2025-10-20 — PlanetScale, Richard Crowley, 2025-11-03. Canonical wiki application. Phase 2 of the 2025-10-20 AWS us-east-1 incident. PlanetScale paused its continuous 30-day drain-and- terminate loop, switched from terminate-on-vacant to hold- for-reuse, and cancelled / delayed pending backups — which implicitly launch a replica-restore EC2 instance. No disclosure on how many instances were held vs normally drained during the window.
Last updated · 550 distilled / 1,221 read