CONCEPT Cited by 3 sources

Performance prediction¶

Performance prediction is the problem class of estimating a system's performance metric — throughput, latency, efficiency, resource cost — from a description of its state, without actually running the system (or running only a cheap proxy). The alternative is to run the authoritative solver / simulator / production workload every time a prediction is needed; at cluster scale that's prohibitive.

Why it matters at scale¶

Three recurring shapes motivate performance-prediction work:

Scheduling / resource allocation. Schedulers want to evaluate many candidate placements per job. Running the combinatorial bin-packer against every candidate is too expensive; a predictor can short- circuit bad candidates.
Capacity planning. Forecasting the efficiency of a hypothetical fleet configuration (hardware mix, workload mix, scaling knob) requires running the real system against the hypothetical — which doesn't exist yet.
Counterfactual policy evaluation. Before rolling out a scheduler / pricing / routing change, you want to know what would have happened under the new policy on historical state — without re-running production.

Traditional approaches¶

Analytical models. Hand-written queueing-theory / performance-model expressions. Accurate when the system is simple; break on heterogeneous, bursty, multi-tenant systems.
Discrete-event simulators. Replay a workload trace against a simulated stack. Accurate but slow; the cost of the simulation is proportional to the cost of the original system.
Tabular ML regression. Feature-engineer the system state into a fixed-length vector; train a GBM / MLP / linear model on (features, metric) pairs. Fast at inference but fragile to schema change; feature engineering dominates.

Text-to-text regression as a general answer¶

The 2025-07-29 Google post positions text-to-text regression with language models as a general path to performance prediction that sidesteps the feature-engineering cost of tabular ML. The RLM reads the state as a string and emits the metric as a string; no feature engineering, no normalisation, no schema migration when new data types appear. The production demonstration is predicting MIPS per GCU on Borg — specifically, the numeric output of the bin-packing solver Google runs inside its digital-twin backtester (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

The cheap-approximator / expensive-fallback deployment¶

Performance predictors are typically deployed as cheap approximators with an expensive fallback: the fast ML model answers most queries, and the slow authoritative solver is invoked only when the approximator reports low confidence. This pattern requires calibrated uncertainty — the predictor must know when it doesn't know — so uncertainty quantification is load-bearing, not decorative.

Canonical wiki instance¶

Google's 2025-07-29 RLM work on Borg / MIPS-per-GCU is the wiki's canonical production instance of performance prediction:

Input (x): YAML/JSON serialisation of Borg cluster state (active jobs, execution traces, textual metadata, hardware descriptors).
Target (y): MIPS per GCU — the output of a specialised bin-packing algorithm run inside the Borg digital twin.
Predictor: 60M-param two-layer encoder-decoder RLM with an 8k-token context window.
Uncertainty: recovered by sampling multiple decodes; correlates with residual squared error.
Reported quality: "near-perfect" Spearman rank correlation across diverse Borg regression tasks; actual ⍴ values in the backing paper, not the blog post.

Query-plan cost estimation — the classic OLAP variant of performance prediction; optimiser picks among query plans using a cost model. (Not yet ingested on the wiki.)
Real-time decision systems share the "cheap approximator trading accuracy for latency" shape.

Open questions¶

Wall-clock speedup of the RLM vs. the bin-packer it replaces is not disclosed in the 2025-07-29 post.
How much of the cheap-approximator's accuracy is the LM architecture vs. the data — i.e. whether a well-tuned tabular GBM on engineered Borg features would match at 10× lower inference cost — is an open question the post does not address.

Second Google Research proof point: VM-lifetime prediction (2025-10-17)¶

The 2025-10-17 LAVA post opens a second performance- prediction angle on Borg-adjacent scheduling at a different layer. Where the RLM predicts the bin-packer's output (MIPS-per-GCU) so the scheduler's inner loop can skip the slow solver, the LAVA family predicts a different target — the remaining-lifetime distribution of individual VMs — and uses that prediction to augment the placement policy itself (Source: sources/2025-10-17-google-solving-virtual-machine-puzzles-lava).

Two load-bearing disciplines distinguish it as a subclass:

Asymmetric decision cost. A wrong VM-lifetime prediction can "tie up an entire host for an extended period" — the cost is non-linear in the error and bounded below only by the VM's actual lifetime.
Continuous reprediction — the LAVA family doesn't commit to a single prediction at VM creation; the estimate updates as the VM runs. This makes the predictor an online component of the scheduler's state, not an offline classifier.

The RLM (bin-packer-output prediction) and LAVA (VM-lifetime prediction) together illustrate performance prediction at two different insertion points on the same substrate — the pattern of ML-for-systems-with-production-proof-points recurs, but the what's being predicted shifts.

Seen in¶

sources/2025-07-29-google-simulating-large-systems-with-regression-language-models — Borg MIPS-per-GCU prediction via text-to-text regression.
sources/2025-10-17-google-solving-virtual-machine-puzzles-lava — VM-lifetime prediction with continuous reprediction as input to scheduler placement + rescheduling policy.
sources/2026-04-07-mongodb-predictive-auto-scaling-an-experiment — MongoDB Atlas's Estimator is a performance-prediction component deployed inside a control loop: boosted-decision- tree regressor trained on 25 M (demand, instance size, CPU) samples that maps forecasted demand + candidate tier → expected CPU %. Contrasts with Google's RLM-on-Borg along two axes: (1) input dimensionality (RLM reads raw YAML cluster state; MongoDB's Estimator takes clean (demand, size) pairs — decoupled from the Forecaster by design); (2) uncertainty signal source — RLM's sampled distribution width vs. MongoDB's recent-accuracy gate on the Forecaster (the Estimator itself emits point predictions). Same deployment pattern at both layers: cheap- approximator-with-expensive-fallback — act on the prediction when confident, fall back to the authoritative solver / reactive mechanism otherwise.