Skip to content

CONCEPT Cited by 1 source

Predictive auto-scaling

Predictive auto-scaling is capacity control that scales a workload's resources before observed load arrives, using a forecast of future demand, rather than reacting to observed over/under-utilisation after it happens. It is the forecast-driven branch of the autoscaling taxonomy; its sibling is reactive auto-scaling.

The defining operational property: predictive auto-scaling can hide scaling latency from the latency-critical path by starting the scaling operation before load rises, so the new capacity is live by the time the spike hits. Reactive scaling is structurally incapable of this because it needs the spike to have already happened (or be happening) to trigger.

Why it exists

Reactive scaling's latency is bounded by the sum of four terms:

latency = detection_time       # wait for sustained over/underload
        + decision_time         # pick new size
        + provisioning_time     # allocate / boot
        + warmup_time           # traffic live on new capacity

MongoDB Atlas's pre-2025 reactive auto-scaler, Atlas's own retrospective: "scales up after a few minutes of overload, or a few hours of underload… the scaling operation itself takes a few minutes to change the replica set's server sizes" (Source: sources/2026-04-07-mongodb-predictive-auto-scaling-an-experiment). Several-minute detection + several-minute scaling op = the p99 during a spike is degraded for the full sum no matter how aggressive the thresholds are.

Two further structural reactive limits named in the same source:

  • One tier at a time — M60 → M50, never M60 → M10, even if demand collapses; "if the customer's demand changes dramatically, it takes several scaling operations to reach the optimum server size."
  • Overloaded servers interfere with their own scaling"if it's really slammed, it could interfere with the scaling operation itself."

Predictive scaling dissolves both: scale ahead (no overload present during the op) and scale directly to the forecasted-right size (no multi-step convergence).

What makes it possible

Two empirical preconditions named by MongoDB:

  1. Workload seasonality — most Atlas replica sets have daily seasonality, ~25% have weekly. Daily / weekly cycles are predictable from several weeks of history via MSTL-family time-series decomposition.
  2. Short-term trends — when seasonality is absent, the last 1–2 hours of data often extrapolate forward better than naïve last-observation (MongoDB: 68% win rate, 29% error reduction).

The complement — replica sets that are neither seasonal nor trended at short horizon — are "non-predictable" and must fall back to reactive scaling.

Required primitives

A predictive auto-scaler needs, at minimum:

  • A forecaster over future demand. MongoDB used MSTL + ARIMA for the long-term and trend interpolation for the short-term. See two-forecaster pattern.
  • A demand→capacity estimator — given a forecasted workload and a candidate instance size, predict CPU% (or whichever resource is the bottleneck). MongoDB used a boosted-decision-tree regressor over 25 M training samples.
  • A planner — pick the cheapest instance size that holds the forecasted demand under a ceiling (MongoDB: 15 minutes ahead, ≤75% CPU). See patterns/forecast-then-size-planner.
  • Exogenous input metrics — forecast QPS / connections / scanned-objects rate, not CPU directly, to avoid the circular-forecast hazard.
  • A confidence gate — only act on the forecast when its recent accuracy justifies trust. MongoDB used self-censoring: forecaster scores its own recent error, emits prediction only when it's small enough.

Missing any one collapses the system: an unconditioned point forecast makes wrong calls often enough to reduce net p99.

Cost & carbon framing

From the MongoDB source: "An underloaded server costs the customer more than necessary. An overloaded server is bad for performance." Both directions carry cost:

  • Over-scaled hours — customer billed more, MongoDB spends more on the cloud provider, infrastructure carbon higher.
  • Under-scaled hours — latency / timeout / availability regressions; customer-visible incident.

"If we could anticipate each customer's needs and perfectly scale their servers up and down… that would save our customers money and reduce our carbon emissions." Predictive scaling's theoretical optimum is the curve of "perfectly right-sized at every moment," moving the denominator of elasticity from "the customer never thinks about it" to "the customer is never charged for idle capacity."

Asymmetric risk: scale-up vs scale-down

MongoDB's production predictive auto-scaler (November 2025 rollout) ships scale-up-only, "we rely on the existing reactive algorithm to scale them down afterward." The implicit risk framing:

  • Incorrect forecast → unnecessary scale-up → cost regression (the customer is billed slightly more; reactive scaler will scale down later).
  • Incorrect forecast → unnecessary scale-down → latency / overload regression (customer-visible; hard to refund the incident after the fact).

Asymmetric risk is a standard shape in control systems: take the cheaper-if-wrong action on forecasts, reserve the more-dangerous-if-wrong action for the reactive backstop that observes ground truth before acting.

Relationship to reactive auto-scaling

They're complementary, not alternatives:

  • Predictive handles predictable load changes with zero observed-latency cost.
  • Reactive handles unpredictable load changes and forecast failures.

MongoDB's production architecture explicitly runs both: "All customers who enabled auto-scaling (about a third) will soon have both predictive and reactive auto-scaling." The predictive layer acts when it's confident; the reactive layer is the backstop otherwise.

Exclusion set

Not every workload is predictive-scalable. MongoDB's 2023 prototype excluded 13% of replica sets from predictive scaling because the Estimator's CPU prediction was too inaccurate. The decision framework:

  • Low Estimator accuracy on this replica set → exclude (no basis for a plan).
  • Low Forecaster accuracy on this replica set (no seasonality, no useful short-term trend) → fall back to reactive only.
  • High accuracy on both → predictive scaling.

The exclusion is per-replica-set, not global — same framework, different replica sets get different treatment.

Seen in

Last updated · 200 distilled / 1,178 read