CONCEPT Cited by 1 source
Diurnal autoscaling risk¶
Definition¶
Diurnal autoscaling risk is the structural vulnerability that arises when a fleet scales on a predictable 24-hour pattern — scale up before a known peak-traffic window, scale down after it — and an upstream capacity-provisioning failure coincides with the scale-up window. The fleet enters its peak traffic period holding the previous trough's capacity because the usual morning ramp never happened.
It is an intersection risk, not a class-of-autoscaler risk: both
reactive and predictive autoscalers are affected equally when the
underlying provisioning system (e.g. EC2 RunInstances) is down.
The PlanetScale 2025-10-20 worked example¶
Verbatim from the 2025-10-20 incident post:
Given that the US East Coast was about to start their Monday, the inability to launch new EC2 instances presented a risk to some of our largest customers who use diurnal autoscaling for the vtgate component of their Vitess clusters. Some were going to be coming into their peak weekly traffic with less than half the vtgate capacity they had the week prior.
Three load-bearing specifics:
- Which component autoscales — systems/vtgate, the stateless Vitess query router. Database primaries don't scale-out; vtgate is the elastic tier.
- The timing coincidence — phase 2 of the incident (~10:05 UTC to ~20:32 UTC) straddles the US East Coast Monday-morning ramp.
- The quantified risk — "less than half the vtgate capacity they had the week prior" — customers facing peak weekly traffic while their fleet is stuck at the weekend trough size.
Why diurnal patterns are the worst case for this risk¶
- Scale-up window is daily and predictable. Operators and customers expect the morning ramp. When it doesn't happen, SLOs break within the same business day.
- Scale-up compounds. A fleet that missed Monday's ramp also won't ramp for Tuesday unless the underlying capacity problem is fixed — every subsequent peak is at risk until provisioning works again.
- Scale-down already happened. The overnight trough destroyed the capacity headroom an incident-day peak would need; there's no pre-built buffer to fall back on.
Contrast with rare-event autoscaling (Black Friday / Cyber Monday prep, launch-day spike prep, incident-response scale-out): those events have operators manually pre-provisioning capacity hours or days in advance, so even a short EC2-launch outage doesn't catch them flat-footed.
Mitigations¶
Response levers from the 2025-10-20 playbook:
- Bin-pack tighter than usual — run the existing fleet closer to CPU capacity to cover peak without needing more instances. See patterns/conservative-capacity-bin-packing-during-incident.
- Shed non-critical load on the caller side — advise customers using autoscaling to "shed whatever load they could by e.g. delaying queue processing or pausing ETL processes." See patterns/shed-load-during-capacity-shortage.
- Redirect elastic creation to unaffected regions — new
resources go to
us-east-2during theus-east-1outage.
Deeper structural levers (not taken during the 2025-10-20 incident but named as future direction in PlanetScale's post-mortem):
- Less-diurnal fleets. Keep a bigger baseline so the morning ramp is smaller relative to steady-state. Trade cost for resilience against capacity-provisioning outages.
- Cross-region capacity pools. If vtgate autoscales across regions for the same customer, a single-region launch outage is absorbable.
- Decouple compute-location from capacity-signal. Pre-warm instances on the healthy region ahead of the anticipated ramp — makes diurnal scale-up a cross-region hot-swap rather than a single-region launch-event.
Failure pattern name for postmortems¶
Diurnal autoscaling risk is a variant of the broader "incident plus ramp" compound-failure pattern: an infrastructure fault that is mild in isolation becomes severe when it coincides with a scheduled or expected traffic growth. It's worth naming on its own because:
- The mitigations are specific (bin-pack, shed, cross-region redirect).
- The diagnostic signal is specific ("the incident is within the morning ramp window" + "fleet size is closer to the overnight trough than the previous day's peak").
- The risk is asymmetric against reactive autoscalers: the autoscaler wants to add capacity — it's trying — it just can't.
Seen in¶
- sources/2025-11-03-planetscale-aws-us-east-1-incident-2025-10-20 — PlanetScale, Richard Crowley, 2025-11-03. Canonical wiki entry. Phase 2 of the 2025-10-20 incident names vtgate diurnal autoscaling as the specific risk amplifier — at-risk customers entering Monday peak with "less than half the vtgate capacity they had the week prior" because the morning ramp couldn't acquire new EC2 instances.
Related¶
- concepts/predictive-autoscaling — schedule-based autoscaler, the most common implementation substrate for diurnal scaling.
- concepts/reactive-autoscaling — metric-driven autoscaler, equally affected when launches fail.
- concepts/ec2-launch-failure-mode — the capacity-side fault class that coincides with the diurnal ramp.
- concepts/control-plane-data-plane-separation — the architectural context in which vtgate autoscaling sits.
- patterns/conservative-capacity-bin-packing-during-incident — the response pattern.
- patterns/shed-load-during-capacity-shortage — the customer-side response pattern.
- systems/vtgate — the specific component whose autoscaling was the risk surface in the 2025-10-20 incident.