CONCEPT Cited by 1 source

Predictive auto-scaling¶

Predictive auto-scaling is capacity control that scales a workload's resources before observed load arrives, using a forecast of future demand, rather than reacting to observed over/under-utilisation after it happens. It is the forecast-driven branch of the autoscaling taxonomy; its sibling is reactive auto-scaling.

The defining operational property: predictive auto-scaling can hide scaling latency from the latency-critical path by starting the scaling operation before load rises, so the new capacity is live by the time the spike hits. Reactive scaling is structurally incapable of this because it needs the spike to have already happened (or be happening) to trigger.

Why it exists¶

Reactive scaling's latency is bounded by the sum of four terms:

latency = detection_time       # wait for sustained over/underload
        + decision_time         # pick new size
        + provisioning_time     # allocate / boot
        + warmup_time           # traffic live on new capacity

MongoDB Atlas's pre-2025 reactive auto-scaler, Atlas's own retrospective: "scales up after a few minutes of overload, or a few hours of underload… the scaling operation itself takes a few minutes to change the replica set's server sizes" (Source: sources/2026-04-07-mongodb-predictive-auto-scaling-an-experiment). Several-minute detection + several-minute scaling op = the p99 during a spike is degraded for the full sum no matter how aggressive the thresholds are.

Two further structural reactive limits named in the same source:

One tier at a time — M60 → M50, never M60 → M10, even if demand collapses; "if the customer's demand changes dramatically, it takes several scaling operations to reach the optimum server size."
Overloaded servers interfere with their own scaling — "if it's really slammed, it could interfere with the scaling operation itself."

Predictive scaling dissolves both: scale ahead (no overload present during the op) and scale directly to the forecasted-right size (no multi-step convergence).

What makes it possible¶

Two empirical preconditions named by MongoDB:

Workload seasonality — most Atlas replica sets have daily seasonality, ~25% have weekly. Daily / weekly cycles are predictable from several weeks of history via MSTL-family time-series decomposition.
Short-term trends — when seasonality is absent, the last 1–2 hours of data often extrapolate forward better than naïve last-observation (MongoDB: 68% win rate, 29% error reduction).

The complement — replica sets that are neither seasonal nor trended at short horizon — are "non-predictable" and must fall back to reactive scaling.

Required primitives¶

A predictive auto-scaler needs, at minimum:

A forecaster over future demand. MongoDB used MSTL + ARIMA for the long-term and trend interpolation for the short-term. See two-forecaster pattern.
A demand→capacity estimator — given a forecasted workload and a candidate instance size, predict CPU% (or whichever resource is the bottleneck). MongoDB used a boosted-decision-tree regressor over 25 M training samples.
A planner — pick the cheapest instance size that holds the forecasted demand under a ceiling (MongoDB: 15 minutes ahead, ≤75% CPU). See patterns/forecast-then-size-planner.
Exogenous input metrics — forecast QPS / connections / scanned-objects rate, not CPU directly, to avoid the circular-forecast hazard.
A confidence gate — only act on the forecast when its recent accuracy justifies trust. MongoDB used self-censoring: forecaster scores its own recent error, emits prediction only when it's small enough.

Missing any one collapses the system: an unconditioned point forecast makes wrong calls often enough to reduce net p99.

Cost & carbon framing¶

From the MongoDB source: "An underloaded server costs the customer more than necessary. An overloaded server is bad for performance." Both directions carry cost:

Over-scaled hours — customer billed more, MongoDB spends more on the cloud provider, infrastructure carbon higher.
Under-scaled hours — latency / timeout / availability regressions; customer-visible incident.

"If we could anticipate each customer's needs and perfectly scale their servers up and down… that would save our customers money and reduce our carbon emissions." Predictive scaling's theoretical optimum is the curve of "perfectly right-sized at every moment," moving the denominator of elasticity from "the customer never thinks about it" to "the customer is never charged for idle capacity."

Asymmetric risk: scale-up vs scale-down¶

MongoDB's production predictive auto-scaler (November 2025 rollout) ships scale-up-only, "we rely on the existing reactive algorithm to scale them down afterward." The implicit risk framing:

Incorrect forecast → unnecessary scale-up → cost regression (the customer is billed slightly more; reactive scaler will scale down later).
Incorrect forecast → unnecessary scale-down → latency / overload regression (customer-visible; hard to refund the incident after the fact).

Asymmetric risk is a standard shape in control systems: take the cheaper-if-wrong action on forecasts, reserve the more-dangerous-if-wrong action for the reactive backstop that observes ground truth before acting.

Relationship to reactive auto-scaling¶

They're complementary, not alternatives:

Predictive handles predictable load changes with zero observed-latency cost.
Reactive handles unpredictable load changes and forecast failures.

MongoDB's production architecture explicitly runs both: "All customers who enabled auto-scaling (about a third) will soon have both predictive and reactive auto-scaling." The predictive layer acts when it's confident; the reactive layer is the backstop otherwise.

Exclusion set¶

Not every workload is predictive-scalable. MongoDB's 2023 prototype excluded 13% of replica sets from predictive scaling because the Estimator's CPU prediction was too inaccurate. The decision framework:

Low Estimator accuracy on this replica set → exclude (no basis for a plan).
Low Forecaster accuracy on this replica set (no seasonality, no useful short-term trend) → fall back to reactive only.
High accuracy on both → predictive scaling.

The exclusion is per-replica-set, not global — same framework, different replica sets get different treatment.

Seen in¶

sources/2026-04-07-mongodb-predictive-auto-scaling-an-experiment — MongoDB's 2023 research prototype + 2025 production rollout retrospective; canonical managed-database instance of the concept.
sources/2025-07-29-google-simulating-large-systems-with-regression-language-models — sibling at Google Borg cluster-scheduler layer: predict MIPS-per-GCU via a Regression Language Model, gate on predicted-distribution width; not strictly auto-scaling but the same forecast-and-act substrate.
sources/2025-10-17-google-solving-virtual-machine-puzzles-lava — VM-scheduler analogue: predict lifetime distributions, place + reschedule accordingly. Capacity decision made once at allocation time rather than continuously rescaled.

concepts/reactive-autoscaling — the sibling-and-backstop concept.
concepts/scaling-latency — the latency class predictive scaling is designed to hide.
concepts/elasticity — the architectural property predictive scaling is trying to deliver with less waste on both sides.
concepts/customer-driven-metrics — the input the Forecaster must use to avoid circular forecasts.
concepts/seasonality-daily-weekly — the main signal predictive scaling exploits at long horizon.
concepts/self-censoring-forecast — the confidence-gate primitive that makes fallback safe.
concepts/self-invalidating-forecast — the hazard class "predict CPU then scale to flatten CPU" falls into.
concepts/tier-based-instance-sizing — the instance-size abstraction the planner picks from.
concepts/performance-prediction — the Estimator's problem class; predictive auto-scaling is a deployment of performance prediction inside a control loop.
concepts/spiky-traffic — the traffic class that's too fast for reactive scaling; predictive scaling is the answer when the spikes are predictable.
patterns/forecast-then-size-planner — the canonical three-component architecture.
patterns/short-plus-long-term-forecaster — the two-forecaster-on-same-metric shape that handles seasonal + non- seasonal replica sets within one pipeline.
patterns/cheap-approximator-with-expensive-fallback — sibling at the "use model when confident, fall back otherwise" design axis.
systems/mongodb-atlas — the canonical wiki deployment instance.