Skip to content

CONCEPT Cited by 1 source

Diurnal autoscaling risk

Definition

Diurnal autoscaling risk is the structural vulnerability that arises when a fleet scales on a predictable 24-hour pattern — scale up before a known peak-traffic window, scale down after it — and an upstream capacity-provisioning failure coincides with the scale-up window. The fleet enters its peak traffic period holding the previous trough's capacity because the usual morning ramp never happened.

It is an intersection risk, not a class-of-autoscaler risk: both reactive and predictive autoscalers are affected equally when the underlying provisioning system (e.g. EC2 RunInstances) is down.

The PlanetScale 2025-10-20 worked example

Verbatim from the 2025-10-20 incident post:

Given that the US East Coast was about to start their Monday, the inability to launch new EC2 instances presented a risk to some of our largest customers who use diurnal autoscaling for the vtgate component of their Vitess clusters. Some were going to be coming into their peak weekly traffic with less than half the vtgate capacity they had the week prior.

Three load-bearing specifics:

  1. Which component autoscalessystems/vtgate, the stateless Vitess query router. Database primaries don't scale-out; vtgate is the elastic tier.
  2. The timing coincidence — phase 2 of the incident (~10:05 UTC to ~20:32 UTC) straddles the US East Coast Monday-morning ramp.
  3. The quantified risk"less than half the vtgate capacity they had the week prior" — customers facing peak weekly traffic while their fleet is stuck at the weekend trough size.

Why diurnal patterns are the worst case for this risk

  • Scale-up window is daily and predictable. Operators and customers expect the morning ramp. When it doesn't happen, SLOs break within the same business day.
  • Scale-up compounds. A fleet that missed Monday's ramp also won't ramp for Tuesday unless the underlying capacity problem is fixed — every subsequent peak is at risk until provisioning works again.
  • Scale-down already happened. The overnight trough destroyed the capacity headroom an incident-day peak would need; there's no pre-built buffer to fall back on.

Contrast with rare-event autoscaling (Black Friday / Cyber Monday prep, launch-day spike prep, incident-response scale-out): those events have operators manually pre-provisioning capacity hours or days in advance, so even a short EC2-launch outage doesn't catch them flat-footed.

Mitigations

Response levers from the 2025-10-20 playbook:

  • Bin-pack tighter than usual — run the existing fleet closer to CPU capacity to cover peak without needing more instances. See patterns/conservative-capacity-bin-packing-during-incident.
  • Shed non-critical load on the caller side — advise customers using autoscaling to "shed whatever load they could by e.g. delaying queue processing or pausing ETL processes." See patterns/shed-load-during-capacity-shortage.
  • Redirect elastic creation to unaffected regions — new resources go to us-east-2 during the us-east-1 outage.

Deeper structural levers (not taken during the 2025-10-20 incident but named as future direction in PlanetScale's post-mortem):

  • Less-diurnal fleets. Keep a bigger baseline so the morning ramp is smaller relative to steady-state. Trade cost for resilience against capacity-provisioning outages.
  • Cross-region capacity pools. If vtgate autoscales across regions for the same customer, a single-region launch outage is absorbable.
  • Decouple compute-location from capacity-signal. Pre-warm instances on the healthy region ahead of the anticipated ramp — makes diurnal scale-up a cross-region hot-swap rather than a single-region launch-event.

Failure pattern name for postmortems

Diurnal autoscaling risk is a variant of the broader "incident plus ramp" compound-failure pattern: an infrastructure fault that is mild in isolation becomes severe when it coincides with a scheduled or expected traffic growth. It's worth naming on its own because:

  • The mitigations are specific (bin-pack, shed, cross-region redirect).
  • The diagnostic signal is specific ("the incident is within the morning ramp window" + "fleet size is closer to the overnight trough than the previous day's peak").
  • The risk is asymmetric against reactive autoscalers: the autoscaler wants to add capacity — it's trying — it just can't.

Seen in

  • sources/2025-11-03-planetscale-aws-us-east-1-incident-2025-10-20 — PlanetScale, Richard Crowley, 2025-11-03. Canonical wiki entry. Phase 2 of the 2025-10-20 incident names vtgate diurnal autoscaling as the specific risk amplifier — at-risk customers entering Monday peak with "less than half the vtgate capacity they had the week prior" because the morning ramp couldn't acquire new EC2 instances.
Last updated · 550 distilled / 1,221 read