MongoDB Predictive Auto-Scaling: An Experiment¶

Summary¶

MongoDB Engineering retrospective on the 2023 internal research prototype that explored whether a predictive auto-scaler could outperform MongoDB Atlas's then-existing reactive auto-scaler on a random sample of 10,000 replica sets. The reactive algorithm scales one tier at a time (M60 → M50, never M60 → M10 directly), reacts only after "a few minutes of overload or a few hours of underload," and pays a multi-minute scaling operation on top — so dramatic demand shifts take several scaling operations to settle. The prototype hypothesised that temporal patterns (daily / weekly cycles + short-term trends) in customer workloads are exploitable: scale up just before a forecasted spike, skip intermediate tiers, and save both money + p99 latency. Post is the research-side look-back now that a different, production-grade predictive auto-scaler (scale-up-only, complements the existing reactive scale-down) has shipped in MongoDB Atlas (rolled out starting November 2025; separate product announcement).

The prototype had three components forming a Forecaster + Estimator + Planner pipeline. Forecaster predicts each replica set's future customer-driven metrics (queries/sec, client connections, scanned-objects rate) — metrics chosen precisely because they're independent of instance size and scaling actions, avoiding the circular forecast that predicting CPU directly would produce. Estimator maps forecasted demand × candidate instance size → projected CPU%. Planner picks the cheapest instance the Estimator says can hold the next ~15 minutes of demand under a 75% CPU ceiling. The forecaster is further split into a Long-Term Forecaster (MSTL + ARIMA residuals on several weeks of history, captures daily / weekly seasonality) and a Short-Term Forecaster (trend interpolation on the last 1–2 hours, used when the long-term signal isn't trustworthy) — the two-forecaster shape.

Reported prototype numbers on the 10,000-replica-set test sample: ~25% of replica sets have weekly seasonality, most have daily seasonality, hourly seasonality is rare and useless for quarter-hour scaling decisions anyway. Short-term trend interpolation beats naïve-last-observation 68% of the time, 29% reduction in error. Estimator error rates: ~45% of replica sets under 7% CPU error, ~42% "somewhat less accurate but useful in extreme cases," remaining 13% excluded from predictive scaling. Headline savings claim: 9¢/hour/replica-set average, extrapolated as "millions of dollars a year if… enabled for all MongoDB Atlas users." Load-bearing self-censoring primitive: the Long-Term Forecaster scores its own confidence from recent accuracy and only emits a prediction when recent error has been small.

Explicit disclosure on the relationship between prototype and shipped product: "the production version of the algorithm is quite different from the prototype, and so far, it only scales replica sets up before a predicted load spike; we rely on the existing reactive algorithm to scale them down afterward." Experimental prototype → different production code — the prototype-before-production shape MongoDB used here is the "godparent, not parent" variant: the prototype's value was the learning, not the code.

Key takeaways¶

Reactive auto-scaling has a structural latency floor ≈ reaction time + scaling operation time. MongoDB Atlas's reactive auto-scaler waits "a few minutes of overload / a few hours of underload" before triggering, then the scaling op itself takes "several minutes." An overloaded server "could interfere with the scaling operation itself." "To radically improve auto-scaling, we needed an algorithm that could see the future." Named force: scaling latency added to reaction latency bounds how responsive reactive scaling can ever be. (Source: sources/2026-04-07-mongodb-predictive-auto-scaling-an-experiment)
One tier at a time is the structural binder on dramatic demand shifts under reactive scaling. "It only scales between adjacent tiers; for example, if an M60 replica set is underloaded, Atlas will scale it down to M50, but not directly to any tier smaller than that. If the customer's demand changes dramatically, it takes several scaling operations to reach the optimum server size." Predictive scaling's "scale directly to the right server size, skipping intermediate tiers" is the corresponding relaxation. Tiered instance sizing (M10 / M20 / … / M60 / …) is the Atlas abstraction the scaling operates over.
Forecast customer-driven metrics, not CPU — otherwise the forecast self-invalidates. Load-bearing design choice: "we can't just train a model based on recent fluctuations of CPU, because that would create a circular dependency: if we predict a CPU spike and scale accordingly, we eliminate the spike, invalidating the forecast. Instead we forecast metrics unaffected by scaling, which we call 'customer-driven metrics' — e.g., queries per second, number of client connections, and the scanned-objects rate. We assume these are independent of instance size or scaling actions. (Sometimes this is false; a saturated server exerts backpressure on the customer's queries. But customer-driven metrics are normally exogenous.)" Canonical wiki instance of the self-invalidating forecast hazard and its forecast-the-exogenous-inputs remediation.
Long-term forecast via MSTL + ARIMA on several weeks of data, predict a few hours ahead. "Our forecasting model, MSTL (multi-seasonal trend decomposition using LOESS), extracts components from the time series for each customer-driven metric for an individual replica set. It separates long-term trends (e.g., this replica set's query load is steadily growing) and 'seasonal' components (daily and weekly) while isolating residuals. We handle these residuals with a simple autoregressive model from the ARIMA family. … Despite the name, the Long-Term Forecaster doesn't project far into the future; it's trained on several weeks of data to capture patterns, then predicts a few hours ahead." Retraining cadence: "every few minutes, as new samples arrive."
Seasonality distribution matters: daily yes, weekly sometimes, hourly useless. "Most MongoDB Atlas replica sets have daily seasonality. About 25% have weekly seasonality. Generally, if a replica set has weekly seasonality, it also has daily seasonality. Hourly seasonality is rare, and anyway, it isn't helpful for planning a scaling operation that takes a quarter-hour." The ~15-minute scaling-operation horizon sets the floor on which seasonal components are worth forecasting.
Self-censoring fallback: the Long-Term Forecaster scores its own confidence. "So we added a 'self-censoring' mechanism to our prototype: the Long-Term Forecaster scores its own confidence based on its recent accuracy, and only trusts its prediction if its recent error has been small." Uncertainty on the forecast itself gates whether it's used — sibling discipline to the calibration-as-gate seen in patterns/cheap-approximator-with-expensive-fallback. Here the fallback is a second model, not an expensive solver.
Short-Term Forecaster handles non-seasonal replica sets via trend interpolation on 1–2 hours. "So we prototyped a 'Short-Term Forecaster'; this model uses only the last hour or two of data and does trend interpolation. We compared this to a naïve baseline Forecaster, which assumes the future will look like the last observation, and found that trend interpolation beats the baseline 68% of the time (29% reduction in error)." Two-forecaster architecture: long-term when trustworthy, short-term otherwise; "we didn't want to fall back to purely-reactive scaling; we can still do better than that."
Estimator uses boosted decision trees to map demand + instance size → CPU. "Using a regression model based on boosted decision trees trained on millions of samples, we've achieved fairly accurate results. For around 45% of replica sets, our error rate is under 7%, allowing us to make precise scaling decisions. For another 42%, the model is somewhat less accurate but useful in extreme cases. We exclude the remaining 13% of replica sets with higher error rates from predictive scaling." Estimator retrained "rarely", only when new hardware or a more efficient server version becomes available — expected direction "to train an Estimator for each MongoDB version." Hard problem: "we can't see our customers' queries or their data."
Planner target: cheapest tier that holds 15 minutes of demand under 75% CPU. "With both forecasts and CPU estimates, the Planner can choose the cheapest instance size that we guess can handle the next 15 minutes of customer demand without exceeding 75% CPU. Our experiment showed that this predictive scaler, compared to the reactive scaler in use during the test period, would've stayed closer to the CPU target and reduced over- and under-utilization." Three fixed parameters of the 2023 prototype: 15-minute horizon, 75% CPU ceiling, cheapest-that-fits objective.
Headline savings claim: 9¢/hour average replica set; extrapolates to millions/year if enabled fleet-wide. "For the average replica set it saved 9 cents an hour. That could translate to millions of dollars a year if the predictive scaler were enabled for all MongoDB Atlas users." Lower under-utilisation ⇒ lower customer bill + lower MongoDB cloud- compensation cost + lower elasticity- tax carbon footprint. "If we could anticipate each customer's needs and perfectly scale their servers up and down, according to their changing demands, that would save our customers money and reduce our carbon emissions."
Production version is conservative: scale-up-only, reactive-scale-down retained. "In November 2025, we began rolling it out in MongoDB Atlas. All customers who enabled auto-scaling (about a third) will soon have both predictive and reactive auto-scaling. This first version is conservative, it only uses predictions to scale replica sets up. If load declines, the reactive auto-scaler will scale the replica set back down after a few hours." Asymmetric risk framing implicit: unnecessary scale-up is a cost regression (and a slow refund), unnecessary scale-down is a latency / overload regression with direct customer impact — the conservative asymmetry is that the predictive scaler only takes the lower-risk action while the fallback reactive scaler keeps the higher-risk action it already handled.
Prototype → production: "godparent, not parent." "The algorithms are different, and the code is new — the experiment is more of a godparent to the product, rather than its parent." The 2023 prototype's deliverable was the learning: only some replica sets are predictable; short-term trends often beat daily / weekly cycles; forecast exogenous metrics not CPU; self-censor when the forecast isn't trustworthy. The prototype code itself wasn't productionised. Research-then-rewrite realisation of prototype before production at the infrastructure-algorithm layer.

Operational numbers¶

Test sample size: 10,000 random MongoDB Atlas replica sets, history split into training + testing periods in standard ML fashion.
Estimator training corpus: 25 million time-point samples from random replica sets. Each sample = (operations/sec, instance size, CPU utilization).
Planner horizon: 15 minutes ahead.
CPU target: 75% ceiling ("target range 50–75%" implied as the CPU-utilisation band the planner steers toward).
Scaling-operation duration: "several minutes" — sets the lower bound on predict-ahead horizon.
Reactive trigger latency: "a few minutes of overload" / "a few hours of underload".
Long-term forecaster training window: several weeks.
Long-term forecaster prediction horizon: a few hours.
Long-term forecaster retraining cadence: every few minutes.
Short-term forecaster training window: 1–2 hours.
Seasonality distribution: daily = majority of replica sets; weekly = ~25% (and weekly ⇒ daily in practice); hourly = rare and useless at ~15-minute scaling cadence.
Short-term beats naïve baseline: 68% of the time, 29% reduction in error.
Estimator quality tiers: ~45% of replica sets with <7% CPU error (precise decisions); ~42% less accurate but useful in extremes; ~13% excluded from predictive scaling.
Customer adoption ceiling: "about a third" of Atlas customers have auto-scaling enabled.
Reported per-replica-set savings: ~9¢/hour average.
Extrapolation: "millions of dollars a year" at fleet-wide adoption.
Production rollout start: November 2025 (scale-up-only predictive + existing reactive scale-down).
Prototype date: 2023.

Caveats¶

Prototype paper, not production deliverable. The "algorithms are different, and the code is new" — every numeric above ships with that caveat; production behaviour is not directly inferable.
Scale-up-only in production 2025. The prototype's scale-down-on-forecast capability did not ship — "we rely on the existing reactive algorithm to scale them down afterward." Asymmetric risk tolerance not quantified in the post.
13% of replica sets excluded from predictive scaling — diverse workload shapes for which the Estimator isn't accurate enough. Which workload classes and why isn't decomposed in the post.
Customer-driven metrics assumed exogenous — explicitly disclosed as "sometimes false; a saturated server exerts backpressure on the customer's queries." Doesn't rise to a contradiction with the rest of the method, but the assumption is load-bearing and sometimes violated.
Estimator problem is hard — "this is a hard problem, since we can't see our customers' queries or their data, but we did our best." Per-MongoDB-version Estimators is the stated roadmap; within-version workload diversity remains.
Adoption penalty: only "about a third" of Atlas customers have auto-scaling enabled; the fleet-wide savings projection assumes (not all customers opt-in and the scale-up-only variant has different economics than the prototype).
No direct benchmark vs. cheap- approximator-with-expensive-fallback variant — the self-censoring forecaster falls back to a second model (the short-term forecaster) rather than to reactive scaling; the asymmetric-fallback framing is implicit, not directly compared.
Figure content not in captured raw. The raw markdown names nine figures but the image content / captions aren't captured; where the post relies on a figure (e.g. MSTL decomposition, Planner cartoon) the wiki captures only the prose claim.
No discussion of cold-start — how the forecaster handles a newly-created replica set with <1 week of history (hence no daily/weekly seasonality observable) isn't in the post.

Systems¶

systems/mongodb-atlas — MongoDB's managed cloud database where the reactive + predictive auto-scalers live. This source adds predictive auto-scaling as a key capability (rolled out November 2025, scale-up-only) alongside the existing reactive auto-scaler. Tier-based instance-sizing abstraction (M10 / M20 / … / M60 / …) is the granularity the scaler operates on.
systems/mongodb-server — the replica-set software whose CPU utilisation is the target metric. The server's standard per-replica-set performance-metrics history is the data substrate the research depended on ("Atlas keeps servers' past performance metrics").

Concepts¶

concepts/predictive-autoscaling — the primary concept: scale before load arrives by forecasting demand. MongoDB Atlas prototype 2023 + production 2025 is the canonical wiki instance at a managed-database layer.
concepts/reactive-autoscaling — the baseline against which predictive is pitched: scale after observed over/underload. MongoDB Atlas's pre-2025 auto-scaler is the comparison instance; reaction-latency + scaling-operation-latency + one- tier-at-a-time bounds.
concepts/customer-driven-metrics — metrics driven by the customer's workload (QPS, client connections, scanned-objects rate) that are independent of instance size and scaling actions, hence forecastable without self-invalidation. MongoDB 2026-04-07 named this concept precisely.
concepts/seasonality-daily-weekly — the temporal pattern class exploitable by the Long-Term Forecaster: daily + (sometimes) weekly cycles. Adoption distribution named explicitly (most daily, ~25% weekly, hourly rare/irrelevant).
concepts/self-censoring-forecast — the primitive of a model that scores its own recent accuracy and emits a prediction only when recent error has been small. Gatekeeper for the fallback-to-short-term shape.
concepts/self-invalidating-forecast — the hazard class: predicting a metric whose value is affected by the control action that consumes the prediction (CPU utilisation here). MongoDB names it "circular dependency" and remedies by forecasting exogenous inputs instead.
concepts/tier-based-instance-sizing — MongoDB Atlas's discrete-catalog instance-size abstraction (M10, M20, M30, …, M60). Underpins the "one tier at a time" reactive constraint and the "skip intermediate tiers" predictive benefit.
concepts/scaling-latency — reaction time + scaling-op time = the latency predictive scaling works around. MongoDB's numbers: "a few minutes of overload" + "several minutes" scaling operation.
concepts/elasticity — the property predictive scaling is trying to make predictive rather than reactive. The "imaginary perfect auto-scaling algorithm" framing is another articulation of elasticity as an architectural target.
concepts/performance-prediction — the Estimator's problem: predict system performance (CPU%) without running the workload. Close sibling of Google's RLM on Borg (cluster scheduler) but at a managed-database layer — boosted decision trees over (demand, instance size) inputs rather than encoder-decoder LM over YAML cluster state.
concepts/spiky-traffic — the traffic regime predictive scaling targets when bursts are forecastable (cyclical or trended). Extends the concept with a predictable-spiky sub-class where forecasting is the absorption mechanism instead of in-place batching.
concepts/uncertainty-quantification — the self-censoring mechanism is an uncertainty-quantification application: recent-error as a confidence proxy. Sibling of Google's RLM-on-Borg sampled-distribution-width approach but from validation-accuracy rather than from the model's own output distribution.
concepts/circular-dependency — the hazard name MongoDB uses for the forecast-self-invalidation problem. Pre-existing wiki page is about deployment-context circular dependency (dogfooding incidents); this source adds a sibling angle: forecast-context circular dependency (control action invalidates the forecast it's consuming).

Patterns¶

patterns/forecast-then-size-planner — canonical pattern this source introduces: Forecaster predicts customer demand, Estimator maps (demand × candidate size) → CPU, Planner picks cheapest-that-fits within a CPU ceiling and a horizon. Three-component pipeline; each component is independently retrainable. MongoDB Atlas 2023 prototype is the canonical instance.
patterns/short-plus-long-term-forecaster — two forecasters on the same metric at different timescales, selector based on long-term self-censoring signal. Long-term = MSTL + ARIMA on weeks → hours-ahead; short-term = trend interpolation on 1–2 hours → minutes-ahead. MongoDB Atlas prototype is the canonical instance.
patterns/cheap-approximator-with-expensive-fallback — the self-censoring Long-Term Forecaster → Short-Term Forecaster fallback shape is a sibling: uncertainty-gated selection between two predictors. MongoDB's variant has no expensive-solver fallback at the bottom (reactive scaling is the implicit ground truth), so the structural fit is partial — same calibrated-uncertainty-as-control-signal discipline, different fallback topology (two models + reactive backstop vs. model + slow authoritative solver).
patterns/prototype-before-production — MongoDB's explicit "godparent, not parent" framing: 2023 prototype on 10,000 replica sets → learnings (seasonality distribution, exogenous-metric choice, self-censoring necessity, Estimator scope) → 2025 production code that is new code with different algorithms. Research-rewrite variant of the pattern (prototype codebase is not the production codebase).

Source¶

sources/2025-09-25-mongodb-carrying-complexity-delivering-agility — the "intelligence" pillar of the resilience/intelligence/ simplicity manifesto includes "adapt to changing conditions"; predictive auto-scaling is a concrete mechanism realising that pillar at the capacity-planning layer.
sources/2025-07-29-google-simulating-large-systems-with-regression-language-models — sibling ML-for-systems work at Google's Borg cluster manager: predict MIPS-per-GCU from cluster state via a 60M- param RLM, gate on predicted-distribution width. MongoDB's Estimator is a much simpler boosted-trees regressor over (demand, instance size); MongoDB's Forecaster uses classical time-series methods (MSTL + ARIMA) rather than an LM. Different scale, similar structural role (cheap- approximator-with-expensive-fallback).
sources/2025-10-17-google-solving-virtual-machine-puzzles-lava — Google's VM-lifetime-distribution family (NILAS / LAVA / LARS) at the cluster-scheduler layer; uses learned distributions to gate lifetime-aware placement / rescheduling. MongoDB's prototype emits point forecasts rather than distributions; the self-censoring gate uses recent-accuracy rather than within-prediction uncertainty.
sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference — same underlying force (spiky traffic can't be smoothed by autoscaling at ~10-second scales) resolved at a different layer: in-place batching at the inference server instead of predict-then-scale at the capacity layer. Embedding inference is memory-bound + sub-second; DB replica-set capacity is multi-minute; the two fit different traffic shapes.
systems/mongodb-atlas — the deployment surface.
systems/mongodb-server — the replica-set process whose CPU is the target metric.
concepts/predictive-autoscaling
concepts/reactive-autoscaling
concepts/customer-driven-metrics
concepts/seasonality-daily-weekly
concepts/self-censoring-forecast
concepts/self-invalidating-forecast
concepts/tier-based-instance-sizing
patterns/forecast-then-size-planner
patterns/short-plus-long-term-forecaster
patterns/prototype-before-production
companies/mongodb