Google Research — Simulating large systems with Regression Language Models¶
Summary¶
Google Research post (2025-07-29) proposing text-to-text regression
with language models as a general, feature-engineering-free path
to numeric prediction over complex, unstructured system state.
The canonical production instance the post presents is a
Regression Language Model (RLM) predicting
bin-packing-derived efficiency (MIPS per GCU)
for Google's Borg cluster manager — i.e. using
an ML cheap approximator to stand in for an expensive combinatorial
resource-allocation simulator. The RLM is a 60M-parameter two-layer
encoder-decoder that reads a YAML/JSON serialisation of a cluster's
state (active jobs, execution traces, textual metadata) as a prompt
and emits the predicted metric as a decoded text string — no feature
engineering, no normalisation, no fixed-dim tabular vector. Sampling
multiple decodes recovers a full output distribution, so the same
model gives point predictions, uncertainty estimates, and density
estimates in one call. Prediction uncertainty correlates with residual
squared error, enabling a
cheap-approximator-with-expensive-fallback
operating mode — trust the RLM when confident, fall back to the slow
bin-packing simulator when not. The supporting
regress-lm
library is open-sourced.
⚠️ Raw-file scope note. The local raw scrape (
raw/google/2025-07-29-simulating-large-systems-with-regression-language-models-f1350985.md, 15 lines) captures only the acknowledgements section — not the article body. The content summarised on this page was retrieved in-session by fetching the original URL (https://research.google/blog/simulating-large-systems-with-regression-language-models/). Numbers and claims cited below map to the fetched blog post; the backing paper (https://arxiv.org/abs/2506.21718) has richer benchmarks not summarised here.
Key takeaways¶
-
Text-to-text regression collapses the feature-engineering step that dominates traditional tabular regression on complex systems. Traditional regressors require converting configuration files, system logs, and workload descriptors into fixed-length numeric vectors; when new data types appear (new hardware class, new workload, new config field), the pipeline restarts from scratch. An RLM reads the raw state as a string (YAML / JSON) and writes the numeric answer as a string — no normalisation, no feature selection, no fixed schema (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).
-
The production target is performance prediction for Borg, specifically bin-packing-derived MIPS-per-GCU. MIPS per GCU (Millions of Instructions Per Second per Google Compute Unit) is Google's efficiency metric for matching workloads to machines. The RLM is trained against the output of a bin-packing algorithm run inside Google's digital twin of Borg — a backtesting framework that replicates real cluster state — so the ML target is what the bin-packer would have decided, not raw hardware counters (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).
-
Model size is small by LLM standards: a 60M-parameter two-layer encoder-decoder. Google emphasises this is a "relatively simple" encoder-decoder — not a frontier LLM — and frames the RLM as a low-resource, efficient cheap approximator for the bin-packer. The model may be pre-trained or even randomly initialised; adapting to a new regression task is next-token-prediction under cross-entropy loss with (x = state-as-string, y = metric-as-string) (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).
-
Features are prioritised for truncation, not filtered or normalised. Each (x) data point carries up to 1M tokens of candidate features; the RLM has an 8k-token limit. Google's pre-processing step reorders features so the most important features appear first — when the string is truncated at the token limit, only less-important features are lost. This is the canonical shape of token-limit-aware feature prioritization: don't compress, don't summarise, just order-by-importance and let truncation be the filter (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).
-
Multiple decoded samples from the same prompt recover the full output probability density. Because numbers are represented as text and decoding is stochastic, sampling N outputs gives an empirical distribution that approximates the underlying
P(y | x)— including tails. Google reports that these empirical densities track the target distribution across different time durations, visualised as regressor-density curves aligned with a kernel-density-estimate of ground truth. This is how one model delivers point predictions, distributional predictions, and uncertainty at once (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models). -
Prediction uncertainty correlates with residual squared error. Broader predicted distributions signal lower confidence; narrower distributions signal higher confidence. Google names the two uncertainty flavours the RLM captures: aleatoric — inherent randomness of the system, e.g. stochastic load demand — and epistemic indicators stemming from limited observation or features. This calibration is what makes the fast-path/slow-path deployment legitimate: the model knows when it doesn't know, so the slow bin-packing simulation is only invoked on the genuinely uncertain cases (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).
-
Reported pointwise quality: near-perfect Spearman rank correlation across diverse tasks, with the model few-shot adapting to distinct servers and distinct regression tasks. The paper calls out the RLM as an adaptable, universal predictor for Borg. Only scatterplot figures with Spearman ⍴ in the legend are in the blog post; actual ⍴ values, task count, and server coverage are in the backing paper (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).
-
The framing extends beyond Borg. The post positions text- to-text regression as a general path to "universal system simulators and sophisticated reward mechanisms" — reward models for RL-trained LLMs that process raw operational data (system logs, configs, hardware/workload traces) rather than only human preference labels. Borg / MIPS-per-GCU is presented as the proof point, not the scope (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).
Numbers (what the raw / URL verifiably report)¶
| Quantity | Value | Notes |
|---|---|---|
| Model parameters | ~60M | Two-layer encoder-decoder |
| Feature tokens per input (raw) | up to 1M | Before prioritisation + truncation |
| Model token limit | 8k | Forces the prioritisation step |
| Spearman ⍴ on diverse tasks | "near-perfect" | Specific values only in arXiv paper |
| Target metric | MIPS per GCU | Output of a bin-packing algorithm |
| Training target source | Borg digital twin | Backtesting framework |
Unreported in the captured URL: absolute MAE on MIPS-per-GCU, wall-clock speedup vs. the bin-packing simulator, how often the slow path is triggered in production, training/serving fleet size, pre-training corpus size, whether the 60M-param model is deployed in Borg-control-plane inference loops today, and how catastrophic-forgetting is handled when few-shot adapting.
Systems touched¶
- systems/borg — new page. Google's planet-scale cluster manager; the production target for the RLM. The blog post treats Borg as the canonical "large system" whose performance we simulate; MIPS-per-GCU is its efficiency metric; the digital-twin backtester replays real cluster state for training-data generation.
- systems/regression-language-model — new page. The RLM
itself: 60M-param two-layer encoder-decoder, text-in / text-out
numeric prediction, open-sourced as the
regress-lmlibrary. - systems/regress-lm — new page. Google DeepMind's open-source library (https://github.com/google-deepmind/regress-lm) providing the training + inference scaffolding for RLMs.
Concepts introduced / extended¶
- concepts/text-to-text-regression — new page. The core technique: LLM reads (x) as a string, writes (y) as a string, trained with cross-entropy next-token-prediction. Supersedes feature engineering. Generalises to OmniPred (the 2024 predecessor paper) and any numeric-prediction task where tabularisation is the bottleneck.
- concepts/performance-prediction — new page. The problem class of predicting a system's performance metric from its state, without actually running the system. At cluster scale, the alternative is running the scheduler's combinatorial solver (bin-packing) on every hypothetical — expensive.
- concepts/digital-twin-backtesting — new page. Replay real cluster state against a simulated stack to generate training data, evaluate changes before production, or backtest counterfactual policies. The Borg digital twin is Google's instance.
- concepts/uncertainty-quantification — new page. The discipline of producing a confidence estimate alongside a prediction. The RLM does this structurally via the distribution of decoded samples; the blog post names the aleatoric/epistemic split and correlates RLM uncertainty with residual squared error.
- concepts/bin-packing — new page. Combinatorial resource-allocation primitive behind most cluster schedulers (Borg, Kubernetes scheduler, Mesos, Nomad). Expensive to evaluate at scale; the RLM is trained to predict its output.
- concepts/aleatoric-uncertainty — new page. Inherent randomness in the system (stochastic load demand); irreducible no matter how much data you collect.
- concepts/density-estimation — stub acknowledgement; RLM multi-sample decoding is the canonical wiki technique. (Not creating a new page for this; already covered implicitly via concepts/uncertainty-quantification.)
Patterns introduced¶
- patterns/token-limit-aware-feature-prioritization — new page. When features exceed the model's context window, order them by importance and let truncation drop the tail rather than compressing or summarising. Google's pre-processing step for Borg-state serialisation is the canonical instance. Generalises to any LLM-over-large-context application: RAG context packing, trace-to-LLM, log-to-LLM.
- patterns/cheap-approximator-with-expensive-fallback — new page. A fast ML approximator serves most requests; requests where the approximator reports high uncertainty fall back to the slow ground-truth solver. The uncertainty estimate is load-bearing — if it isn't calibrated, the pattern degenerates to always-fast-but-wrong or always-slow. Generalises to query-plan cost estimation, routing cost estimation, ML-backed compilers, any control loop that can tolerate a sub-ms ML step but occasionally needs the real answer.
Operational caveats¶
- Raw captured only acknowledgements. Extraction here is from the live URL fetched in-session, not from the local raw file. If the raw is ever re-scraped, the numbers on this page should be re-verified against it.
- No production deployment detail in the blog post. The post is a research summary. It does not disclose whether the RLM is in the Borg scheduler's live inference path, what the shadow / canary deployment looked like, or what failure modes were observed pre-rollout.
- "Near-perfect Spearman" is the tightest quantitative claim. MAE, RMSE, calibration curves, and task-breakdown ⍴ values are in the backing arXiv paper, not summarised here. Downstream readers should treat the RLM's accuracy claims as paper-mediated, not blog-mediated.
- Few-shot cross-task adaptation is asserted, not quantified in the blog. Learning curves, data-efficiency breakdowns, and catastrophic-forgetting analyses are paper-side.
- Applicability beyond Borg is narrative, not demonstrated in this post. The "universal simulator" framing is forward-looking; the empirical content is Borg-specific.
Source¶
- Original: https://research.google/blog/simulating-large-systems-with-regression-language-models/
- Raw markdown:
raw/google/2025-07-29-simulating-large-systems-with-regression-language-models-f1350985.md - Backing paper: Performance Prediction for Large Systems via Text-to-Text Regression (arXiv:2506.21718)
- Open-source library:
google-deepmind/regress-lm - Prior work: OmniPred (arXiv:2402.14547)
Related¶
- companies/google
- systems/borg
- systems/regression-language-model
- systems/regress-lm
- concepts/text-to-text-regression
- concepts/performance-prediction
- concepts/digital-twin-backtesting
- concepts/uncertainty-quantification
- concepts/bin-packing
- concepts/aleatoric-uncertainty
- patterns/token-limit-aware-feature-prioritization
- patterns/cheap-approximator-with-expensive-fallback