Skip to content

GOOGLE 2025-07-29 Tier 1

Read original ↗

Google Research — Simulating large systems with Regression Language Models

Summary

Google Research post (2025-07-29) proposing text-to-text regression with language models as a general, feature-engineering-free path to numeric prediction over complex, unstructured system state. The canonical production instance the post presents is a Regression Language Model (RLM) predicting bin-packing-derived efficiency (MIPS per GCU) for Google's Borg cluster manager — i.e. using an ML cheap approximator to stand in for an expensive combinatorial resource-allocation simulator. The RLM is a 60M-parameter two-layer encoder-decoder that reads a YAML/JSON serialisation of a cluster's state (active jobs, execution traces, textual metadata) as a prompt and emits the predicted metric as a decoded text string — no feature engineering, no normalisation, no fixed-dim tabular vector. Sampling multiple decodes recovers a full output distribution, so the same model gives point predictions, uncertainty estimates, and density estimates in one call. Prediction uncertainty correlates with residual squared error, enabling a cheap-approximator-with-expensive-fallback operating mode — trust the RLM when confident, fall back to the slow bin-packing simulator when not. The supporting regress-lm library is open-sourced.

⚠️ Raw-file scope note. The local raw scrape (raw/google/2025-07-29-simulating-large-systems-with-regression-language-models-f1350985.md, 15 lines) captures only the acknowledgements section — not the article body. The content summarised on this page was retrieved in-session by fetching the original URL (https://research.google/blog/simulating-large-systems-with-regression-language-models/). Numbers and claims cited below map to the fetched blog post; the backing paper (https://arxiv.org/abs/2506.21718) has richer benchmarks not summarised here.

Key takeaways

  1. Text-to-text regression collapses the feature-engineering step that dominates traditional tabular regression on complex systems. Traditional regressors require converting configuration files, system logs, and workload descriptors into fixed-length numeric vectors; when new data types appear (new hardware class, new workload, new config field), the pipeline restarts from scratch. An RLM reads the raw state as a string (YAML / JSON) and writes the numeric answer as a string — no normalisation, no feature selection, no fixed schema (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

  2. The production target is performance prediction for Borg, specifically bin-packing-derived MIPS-per-GCU. MIPS per GCU (Millions of Instructions Per Second per Google Compute Unit) is Google's efficiency metric for matching workloads to machines. The RLM is trained against the output of a bin-packing algorithm run inside Google's digital twin of Borg — a backtesting framework that replicates real cluster state — so the ML target is what the bin-packer would have decided, not raw hardware counters (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

  3. Model size is small by LLM standards: a 60M-parameter two-layer encoder-decoder. Google emphasises this is a "relatively simple" encoder-decoder — not a frontier LLM — and frames the RLM as a low-resource, efficient cheap approximator for the bin-packer. The model may be pre-trained or even randomly initialised; adapting to a new regression task is next-token-prediction under cross-entropy loss with (x = state-as-string, y = metric-as-string) (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

  4. Features are prioritised for truncation, not filtered or normalised. Each (x) data point carries up to 1M tokens of candidate features; the RLM has an 8k-token limit. Google's pre-processing step reorders features so the most important features appear first — when the string is truncated at the token limit, only less-important features are lost. This is the canonical shape of token-limit-aware feature prioritization: don't compress, don't summarise, just order-by-importance and let truncation be the filter (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

  5. Multiple decoded samples from the same prompt recover the full output probability density. Because numbers are represented as text and decoding is stochastic, sampling N outputs gives an empirical distribution that approximates the underlying P(y | x) — including tails. Google reports that these empirical densities track the target distribution across different time durations, visualised as regressor-density curves aligned with a kernel-density-estimate of ground truth. This is how one model delivers point predictions, distributional predictions, and uncertainty at once (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

  6. Prediction uncertainty correlates with residual squared error. Broader predicted distributions signal lower confidence; narrower distributions signal higher confidence. Google names the two uncertainty flavours the RLM captures: aleatoric — inherent randomness of the system, e.g. stochastic load demand — and epistemic indicators stemming from limited observation or features. This calibration is what makes the fast-path/slow-path deployment legitimate: the model knows when it doesn't know, so the slow bin-packing simulation is only invoked on the genuinely uncertain cases (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

  7. Reported pointwise quality: near-perfect Spearman rank correlation across diverse tasks, with the model few-shot adapting to distinct servers and distinct regression tasks. The paper calls out the RLM as an adaptable, universal predictor for Borg. Only scatterplot figures with Spearman ⍴ in the legend are in the blog post; actual ⍴ values, task count, and server coverage are in the backing paper (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

  8. The framing extends beyond Borg. The post positions text- to-text regression as a general path to "universal system simulators and sophisticated reward mechanisms" — reward models for RL-trained LLMs that process raw operational data (system logs, configs, hardware/workload traces) rather than only human preference labels. Borg / MIPS-per-GCU is presented as the proof point, not the scope (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

Numbers (what the raw / URL verifiably report)

Quantity Value Notes
Model parameters ~60M Two-layer encoder-decoder
Feature tokens per input (raw) up to 1M Before prioritisation + truncation
Model token limit 8k Forces the prioritisation step
Spearman ⍴ on diverse tasks "near-perfect" Specific values only in arXiv paper
Target metric MIPS per GCU Output of a bin-packing algorithm
Training target source Borg digital twin Backtesting framework

Unreported in the captured URL: absolute MAE on MIPS-per-GCU, wall-clock speedup vs. the bin-packing simulator, how often the slow path is triggered in production, training/serving fleet size, pre-training corpus size, whether the 60M-param model is deployed in Borg-control-plane inference loops today, and how catastrophic-forgetting is handled when few-shot adapting.

Systems touched

  • systems/borgnew page. Google's planet-scale cluster manager; the production target for the RLM. The blog post treats Borg as the canonical "large system" whose performance we simulate; MIPS-per-GCU is its efficiency metric; the digital-twin backtester replays real cluster state for training-data generation.
  • systems/regression-language-modelnew page. The RLM itself: 60M-param two-layer encoder-decoder, text-in / text-out numeric prediction, open-sourced as the regress-lm library.
  • systems/regress-lmnew page. Google DeepMind's open-source library (https://github.com/google-deepmind/regress-lm) providing the training + inference scaffolding for RLMs.

Concepts introduced / extended

  • concepts/text-to-text-regressionnew page. The core technique: LLM reads (x) as a string, writes (y) as a string, trained with cross-entropy next-token-prediction. Supersedes feature engineering. Generalises to OmniPred (the 2024 predecessor paper) and any numeric-prediction task where tabularisation is the bottleneck.
  • concepts/performance-predictionnew page. The problem class of predicting a system's performance metric from its state, without actually running the system. At cluster scale, the alternative is running the scheduler's combinatorial solver (bin-packing) on every hypothetical — expensive.
  • concepts/digital-twin-backtestingnew page. Replay real cluster state against a simulated stack to generate training data, evaluate changes before production, or backtest counterfactual policies. The Borg digital twin is Google's instance.
  • concepts/uncertainty-quantificationnew page. The discipline of producing a confidence estimate alongside a prediction. The RLM does this structurally via the distribution of decoded samples; the blog post names the aleatoric/epistemic split and correlates RLM uncertainty with residual squared error.
  • concepts/bin-packingnew page. Combinatorial resource-allocation primitive behind most cluster schedulers (Borg, Kubernetes scheduler, Mesos, Nomad). Expensive to evaluate at scale; the RLM is trained to predict its output.
  • concepts/aleatoric-uncertaintynew page. Inherent randomness in the system (stochastic load demand); irreducible no matter how much data you collect.
  • concepts/density-estimation — stub acknowledgement; RLM multi-sample decoding is the canonical wiki technique. (Not creating a new page for this; already covered implicitly via concepts/uncertainty-quantification.)

Patterns introduced

  • patterns/token-limit-aware-feature-prioritizationnew page. When features exceed the model's context window, order them by importance and let truncation drop the tail rather than compressing or summarising. Google's pre-processing step for Borg-state serialisation is the canonical instance. Generalises to any LLM-over-large-context application: RAG context packing, trace-to-LLM, log-to-LLM.
  • patterns/cheap-approximator-with-expensive-fallbacknew page. A fast ML approximator serves most requests; requests where the approximator reports high uncertainty fall back to the slow ground-truth solver. The uncertainty estimate is load-bearing — if it isn't calibrated, the pattern degenerates to always-fast-but-wrong or always-slow. Generalises to query-plan cost estimation, routing cost estimation, ML-backed compilers, any control loop that can tolerate a sub-ms ML step but occasionally needs the real answer.

Operational caveats

  • Raw captured only acknowledgements. Extraction here is from the live URL fetched in-session, not from the local raw file. If the raw is ever re-scraped, the numbers on this page should be re-verified against it.
  • No production deployment detail in the blog post. The post is a research summary. It does not disclose whether the RLM is in the Borg scheduler's live inference path, what the shadow / canary deployment looked like, or what failure modes were observed pre-rollout.
  • "Near-perfect Spearman" is the tightest quantitative claim. MAE, RMSE, calibration curves, and task-breakdown ⍴ values are in the backing arXiv paper, not summarised here. Downstream readers should treat the RLM's accuracy claims as paper-mediated, not blog-mediated.
  • Few-shot cross-task adaptation is asserted, not quantified in the blog. Learning curves, data-efficiency breakdowns, and catastrophic-forgetting analyses are paper-side.
  • Applicability beyond Borg is narrative, not demonstrated in this post. The "universal simulator" framing is forward-looking; the empirical content is Borg-specific.

Source

Last updated · 200 distilled / 1,178 read