Skip to content

PATTERN Cited by 1 source

Conservative capacity bin-packing during incident

Problem

An upstream capacity-provisioning failure (e.g. EC2 launch failure) has frozen the fleet at its current size. A known peak-traffic window is about to hit. The usual response — let the autoscaler add instances — is not available. Dropping traffic is not acceptable: these are paid customers, and the request load is real.

The fleet has finite running capacity and finite CPU headroom on each existing process. The question is how to trade that headroom for peak-coverage during the incident.

Solution

Bin-pack the workload more tightly than steady-state scheduling normally allows. Co-locate more processes per host than usual, or raise per-process utilisation ceilings, so the existing fleet can serve peak demand without needing new instances. Accept that the fleet is running closer to CPU capacity than is typical — and reverse the tighter packing once provisioning capability returns.

Verbatim from PlanetScale's 2025-10-20 incident post:

The most important intervention, though, was to temporarily change how we schedule vtgate processes for customers with autoscaling configured. We bin-packed vtgate processes more tightly than usual, running closer to CPU capacity than is typical, in order to provide ample capacity for the US work day.

The 2025-10-20 post-mortem names this as "the most important intervention" of the phase-2 response playbook.

Mechanics

The scheduling change typically has two dials:

  • Density — more processes per host. For Kubernetes-scheduled workloads, lower the per-pod CPU request so more pods fit on each node; for process-per-host deployments, co-locate services that usually run on separate hosts.
  • Utilisation ceiling — accept higher steady-state CPU percentage per process. Typical SRE practice is to keep fleets at 40–70 % of CPU to absorb spikes; during the incident, let them sit closer to 85–95 % and rely on short-burst peaks being absorbed by whatever headroom remains.

The trade: latency percentiles will degrade (less headroom for GC pauses, less slack for request-queue spikes), error rate may tick up under transient surges, cache-locality and noisy-neighbour effects get worse.

When this is right

  • The alternative is dropping traffic. When the choice is "slight tail-latency degradation" vs "failed requests or queued admissions," tighter packing wins.
  • The incident has a known or expected duration. Operating at 90% CPU for a 12-hour EC2-launch outage is tolerable; operating there indefinitely is a bad steady state.
  • The workload is mostly stateless. Stateless proxies (vtgate in the 2025-10-20 case, Envoy, any stateless gateway) are good fits — per-process state is small, re-bin-packing is a scheduling change rather than a data migration.
  • You can reverse it quickly. When capacity returns, loosen the packing back to steady-state so you don't accumulate tail-latency debt as a new normal.

When this is wrong

  • Stateful workloads. Databases and caches have per-node memory footprints that don't compress the way CPU does; tightening a database's bin-pack means evicting working-set pages, not just running hotter.
  • Workloads with hard tail-latency SLOs. Some paths really can't tolerate 85% CPU — real-time trading, real-time ad bidding, some video-streaming paths.
  • The fleet is already near its ceiling. If steady-state is already 75% CPU, there isn't room to tighten further without crossing the cliff into starvation / GC-spiral regimes.

Composition with other incident-response moves

Conservative bin-packing rarely appears alone; it sits inside a broader playbook that aims to conserve existing capacity while reducing demand for new capacity:

Together these levers let a frozen-size fleet survive a peak traffic window that would normally require an autoscaler ramp.

Seen in

  • sources/2025-11-03-planetscale-aws-us-east-1-incident-2025-10-20 — PlanetScale, Richard Crowley, 2025-11-03. Canonical wiki application. Phase 2 of the 2025-10-20 AWS us-east-1 incident: EC2-launch-failure window meets US-East-Coast Monday-morning vtgate-autoscale ramp. PlanetScale bin-packs vtgate processes tighter than usual, named as "the most important intervention" of the playbook. No numbers disclosed for the pre / post CPU utilisation or per-customer effect; the claim is qualitative ("ample capacity for the US work day").
Last updated · 550 distilled / 1,221 read