Skip to content

SYSTEM Cited by 1 source

Regression Language Model (RLM)

A Regression Language Model (RLM) is a language model trained to read a string representation of a system's state (x) and emit a numeric metric (y) as a decoded text string, treating regression as next-token prediction under cross-entropy loss. Introduced by Google Research / Google DeepMind in the 2025-07-29 post "Simulating large systems with Regression Language Models" (backing paper arXiv:2506.21718).

The RLM is the canonical wiki instance of text-to-text regression — the broader technique that sidesteps feature-engineering by using an LM's tokenizer as the only "schema."

Architecture

Parameter Value
Architecture Encoder-decoder
Depth 2 layers
Parameters ~60M
Context window 8,192 tokens
Input format YAML or JSON serialisation of system state
Output format Decoded numeric string
Pre-training Optional; also works from random init
Fine-tuning Next-token prediction, cross-entropy loss

(Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models.)

Why it's small

Google emphasises the model's modest scale — "relatively simple two-layer encoder-decoder of 60 million parameters" — as deliberate. The RLM is positioned as a low-resource, efficient cheap approximator for an expensive combinatorial simulator, not as a frontier LLM. The payoff is latency + cost: the RLM runs fast enough to be called inside a scheduling control loop; the bin-packing simulator it replaces does not.

How numbers are handled

Numbers are represented as-is in text — no normalisation, no log-scaling, no binning. Because every number in the input and the target is a string of digit tokens, the model learns the numeric surface directly. Three downstream capabilities fall out of this choice:

  1. Point prediction. Greedy decode one y-string; parse it.
  2. Density estimation. Sample N decodes; the empirical distribution over parsed y-values tracks P(y | x) — including tails. Google reports this density tracks ground-truth KDE curves across time durations.
  3. Uncertainty quantification. Width of the sampled distribution is the confidence. Google reports predicted distribution width correlates with residual squared error.

Feature handling at the context boundary

Each input (x) can carry up to 1M tokens of candidate features (active jobs, execution traces, textual metadata, config). The RLM's context window is 8k tokens. Google's pre-processing step reorders features so the most important appear first; truncation to 8k drops the least important.

This is the canonical token-limit-aware feature prioritization pattern — don't compress, don't summarise, just order-by-importance-and-let-truncation-be-the- filter.

Production target: Borg / MIPS per GCU

The 2025-07-29 post's production-facing proof is Google's use of an RLM to predict MIPS per GCU (Millions of Instructions Per Second per Google Compute Unit) on Borg, Google's cluster manager. The target metric is the output of a specialised bin-packing algorithm run inside the digital twin of Borg — i.e. what the bin-packer would have decided, not raw hardware counters. Few-shot adaptation lets the same base RLM specialise to distinct servers and distinct regression tasks.

Reported quality: near-perfect Spearman rank correlation on diverse Borg regression tasks. Specific ⍴ values are in the paper, not the blog post.

Lineage

The RLM extends Google's earlier universal-regression work OmniPred (arXiv:2402.14547), which first framed generic numeric prediction as text-to-text under a language model. The 2025 RLM post narrows to the large-systems / Borg production context and ships the open-source systems/regress-lm library.

Deployment pattern

Pairing the RLM with the slow bin-packing simulator follows the patterns/cheap-approximator-with-expensive-fallback shape: trust the fast ML approximator when its uncertainty is low, fall back to the authoritative combinatorial solver when not. The calibration claim (uncertainty correlates with residual squared error) is load-bearing — if uncertainty weren't calibrated, the fallback trigger would be unsafe.

Open questions

  • Wall-clock speedup vs. the bin-packing simulator (not disclosed in the 2025-07-29 post).
  • How often the slow fallback fires in production (not disclosed).
  • Catastrophic forgetting under few-shot cross-task adaptation (not discussed).
  • Whether the 60M-param RLM is in Borg's scheduling control-loop today or in a shadow / advisory mode (not disclosed).

Seen in

Last updated · 200 distilled / 1,178 read