CONCEPT Cited by 3 sources
Performance prediction¶
Performance prediction is the problem class of estimating a system's performance metric — throughput, latency, efficiency, resource cost — from a description of its state, without actually running the system (or running only a cheap proxy). The alternative is to run the authoritative solver / simulator / production workload every time a prediction is needed; at cluster scale that's prohibitive.
Why it matters at scale¶
Three recurring shapes motivate performance-prediction work:
- Scheduling / resource allocation. Schedulers want to evaluate many candidate placements per job. Running the combinatorial bin-packer against every candidate is too expensive; a predictor can short- circuit bad candidates.
- Capacity planning. Forecasting the efficiency of a hypothetical fleet configuration (hardware mix, workload mix, scaling knob) requires running the real system against the hypothetical — which doesn't exist yet.
- Counterfactual policy evaluation. Before rolling out a scheduler / pricing / routing change, you want to know what would have happened under the new policy on historical state — without re-running production.
Traditional approaches¶
- Analytical models. Hand-written queueing-theory / performance-model expressions. Accurate when the system is simple; break on heterogeneous, bursty, multi-tenant systems.
- Discrete-event simulators. Replay a workload trace against a simulated stack. Accurate but slow; the cost of the simulation is proportional to the cost of the original system.
- Tabular ML regression. Feature-engineer the system state
into a fixed-length vector; train a GBM / MLP / linear model
on
(features, metric)pairs. Fast at inference but fragile to schema change; feature engineering dominates.
Text-to-text regression as a general answer¶
The 2025-07-29 Google post positions text-to-text regression with language models as a general path to performance prediction that sidesteps the feature-engineering cost of tabular ML. The RLM reads the state as a string and emits the metric as a string; no feature engineering, no normalisation, no schema migration when new data types appear. The production demonstration is predicting MIPS per GCU on Borg — specifically, the numeric output of the bin-packing solver Google runs inside its digital-twin backtester (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).
The cheap-approximator / expensive-fallback deployment¶
Performance predictors are typically deployed as cheap approximators with an expensive fallback: the fast ML model answers most queries, and the slow authoritative solver is invoked only when the approximator reports low confidence. This pattern requires calibrated uncertainty — the predictor must know when it doesn't know — so uncertainty quantification is load-bearing, not decorative.
Canonical wiki instance¶
Google's 2025-07-29 RLM work on Borg / MIPS-per-GCU is the wiki's canonical production instance of performance prediction:
- Input (x): YAML/JSON serialisation of Borg cluster state (active jobs, execution traces, textual metadata, hardware descriptors).
- Target (y): MIPS per GCU — the output of a specialised bin-packing algorithm run inside the Borg digital twin.
- Predictor: 60M-param two-layer encoder-decoder RLM with an 8k-token context window.
- Uncertainty: recovered by sampling multiple decodes; correlates with residual squared error.
- Reported quality: "near-perfect" Spearman rank correlation across diverse Borg regression tasks; actual ⍴ values in the backing paper, not the blog post.
Related framings elsewhere on the wiki¶
- Query-plan cost estimation — the classic OLAP variant of performance prediction; optimiser picks among query plans using a cost model. (Not yet ingested on the wiki.)
- Real-time decision systems share the "cheap approximator trading accuracy for latency" shape.
Open questions¶
- Wall-clock speedup of the RLM vs. the bin-packer it replaces is not disclosed in the 2025-07-29 post.
- How much of the cheap-approximator's accuracy is the LM architecture vs. the data — i.e. whether a well-tuned tabular GBM on engineered Borg features would match at 10× lower inference cost — is an open question the post does not address.
Second Google Research proof point: VM-lifetime prediction (2025-10-17)¶
The 2025-10-17 LAVA post opens a second performance- prediction angle on Borg-adjacent scheduling at a different layer. Where the RLM predicts the bin-packer's output (MIPS-per-GCU) so the scheduler's inner loop can skip the slow solver, the LAVA family predicts a different target — the remaining-lifetime distribution of individual VMs — and uses that prediction to augment the placement policy itself (Source: sources/2025-10-17-google-solving-virtual-machine-puzzles-lava).
Two load-bearing disciplines distinguish it as a subclass:
- Asymmetric decision cost. A wrong VM-lifetime prediction can "tie up an entire host for an extended period" — the cost is non-linear in the error and bounded below only by the VM's actual lifetime.
- Continuous reprediction — the LAVA family doesn't commit to a single prediction at VM creation; the estimate updates as the VM runs. This makes the predictor an online component of the scheduler's state, not an offline classifier.
The RLM (bin-packer-output prediction) and LAVA (VM-lifetime prediction) together illustrate performance prediction at two different insertion points on the same substrate — the pattern of ML-for-systems-with-production-proof-points recurs, but the what's being predicted shifts.
Seen in¶
- sources/2025-07-29-google-simulating-large-systems-with-regression-language-models — Borg MIPS-per-GCU prediction via text-to-text regression.
- sources/2025-10-17-google-solving-virtual-machine-puzzles-lava — VM-lifetime prediction with continuous reprediction as input to scheduler placement + rescheduling policy.
- sources/2026-04-07-mongodb-predictive-auto-scaling-an-experiment
— MongoDB Atlas's Estimator is a performance-prediction
component deployed inside a control loop: boosted-decision-
tree regressor trained on 25 M
(demand, instance size, CPU)samples that maps forecasted demand + candidate tier → expected CPU %. Contrasts with Google's RLM-on-Borg along two axes: (1) input dimensionality (RLM reads raw YAML cluster state; MongoDB's Estimator takes clean(demand, size)pairs — decoupled from the Forecaster by design); (2) uncertainty signal source — RLM's sampled distribution width vs. MongoDB's recent-accuracy gate on the Forecaster (the Estimator itself emits point predictions). Same deployment pattern at both layers: cheap- approximator-with-expensive-fallback — act on the prediction when confident, fall back to the authoritative solver / reactive mechanism otherwise.
Related¶
- concepts/text-to-text-regression
- concepts/digital-twin-backtesting
- concepts/uncertainty-quantification
- concepts/bin-packing
- concepts/vm-lifetime-prediction
- concepts/continuous-reprediction
- concepts/learned-lifetime-distribution
- systems/borg
- systems/regression-language-model
- systems/lava-vm-scheduler
- patterns/cheap-approximator-with-expensive-fallback
- patterns/lifetime-aware-rescheduling
- patterns/learned-distribution-over-point-prediction
- patterns/token-limit-aware-feature-prioritization