Skip to content

CONCEPT Cited by 3 sources

Performance prediction

Performance prediction is the problem class of estimating a system's performance metric — throughput, latency, efficiency, resource cost — from a description of its state, without actually running the system (or running only a cheap proxy). The alternative is to run the authoritative solver / simulator / production workload every time a prediction is needed; at cluster scale that's prohibitive.

Why it matters at scale

Three recurring shapes motivate performance-prediction work:

  • Scheduling / resource allocation. Schedulers want to evaluate many candidate placements per job. Running the combinatorial bin-packer against every candidate is too expensive; a predictor can short- circuit bad candidates.
  • Capacity planning. Forecasting the efficiency of a hypothetical fleet configuration (hardware mix, workload mix, scaling knob) requires running the real system against the hypothetical — which doesn't exist yet.
  • Counterfactual policy evaluation. Before rolling out a scheduler / pricing / routing change, you want to know what would have happened under the new policy on historical state — without re-running production.

Traditional approaches

  • Analytical models. Hand-written queueing-theory / performance-model expressions. Accurate when the system is simple; break on heterogeneous, bursty, multi-tenant systems.
  • Discrete-event simulators. Replay a workload trace against a simulated stack. Accurate but slow; the cost of the simulation is proportional to the cost of the original system.
  • Tabular ML regression. Feature-engineer the system state into a fixed-length vector; train a GBM / MLP / linear model on (features, metric) pairs. Fast at inference but fragile to schema change; feature engineering dominates.

Text-to-text regression as a general answer

The 2025-07-29 Google post positions text-to-text regression with language models as a general path to performance prediction that sidesteps the feature-engineering cost of tabular ML. The RLM reads the state as a string and emits the metric as a string; no feature engineering, no normalisation, no schema migration when new data types appear. The production demonstration is predicting MIPS per GCU on Borg — specifically, the numeric output of the bin-packing solver Google runs inside its digital-twin backtester (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

The cheap-approximator / expensive-fallback deployment

Performance predictors are typically deployed as cheap approximators with an expensive fallback: the fast ML model answers most queries, and the slow authoritative solver is invoked only when the approximator reports low confidence. This pattern requires calibrated uncertainty — the predictor must know when it doesn't know — so uncertainty quantification is load-bearing, not decorative.

Canonical wiki instance

Google's 2025-07-29 RLM work on Borg / MIPS-per-GCU is the wiki's canonical production instance of performance prediction:

  • Input (x): YAML/JSON serialisation of Borg cluster state (active jobs, execution traces, textual metadata, hardware descriptors).
  • Target (y): MIPS per GCU — the output of a specialised bin-packing algorithm run inside the Borg digital twin.
  • Predictor: 60M-param two-layer encoder-decoder RLM with an 8k-token context window.
  • Uncertainty: recovered by sampling multiple decodes; correlates with residual squared error.
  • Reported quality: "near-perfect" Spearman rank correlation across diverse Borg regression tasks; actual ⍴ values in the backing paper, not the blog post.
  • Query-plan cost estimation — the classic OLAP variant of performance prediction; optimiser picks among query plans using a cost model. (Not yet ingested on the wiki.)
  • Real-time decision systems share the "cheap approximator trading accuracy for latency" shape.

Open questions

  • Wall-clock speedup of the RLM vs. the bin-packer it replaces is not disclosed in the 2025-07-29 post.
  • How much of the cheap-approximator's accuracy is the LM architecture vs. the data — i.e. whether a well-tuned tabular GBM on engineered Borg features would match at 10× lower inference cost — is an open question the post does not address.

Second Google Research proof point: VM-lifetime prediction (2025-10-17)

The 2025-10-17 LAVA post opens a second performance- prediction angle on Borg-adjacent scheduling at a different layer. Where the RLM predicts the bin-packer's output (MIPS-per-GCU) so the scheduler's inner loop can skip the slow solver, the LAVA family predicts a different target — the remaining-lifetime distribution of individual VMs — and uses that prediction to augment the placement policy itself (Source: sources/2025-10-17-google-solving-virtual-machine-puzzles-lava).

Two load-bearing disciplines distinguish it as a subclass:

  • Asymmetric decision cost. A wrong VM-lifetime prediction can "tie up an entire host for an extended period" — the cost is non-linear in the error and bounded below only by the VM's actual lifetime.
  • Continuous reprediction — the LAVA family doesn't commit to a single prediction at VM creation; the estimate updates as the VM runs. This makes the predictor an online component of the scheduler's state, not an offline classifier.

The RLM (bin-packer-output prediction) and LAVA (VM-lifetime prediction) together illustrate performance prediction at two different insertion points on the same substrate — the pattern of ML-for-systems-with-production-proof-points recurs, but the what's being predicted shifts.

Seen in

Last updated · 200 distilled / 1,178 read