CONCEPT Cited by 1 source

Uncertainty quantification¶

Uncertainty quantification (UQ) is the discipline of producing a confidence estimate alongside a prediction — not just what the model thinks but how sure it is. In regression, the output is typically a distribution P(y | x) (or summaries like variance, prediction intervals, quantiles) rather than a single point estimate.

UQ matters operationally because most automated decisions that depend on ML predictions need to know when to trust the model. Without calibrated uncertainty, downstream systems either trust it always (unsafe on out-of-distribution inputs) or never (defeats the purpose of ML).

The two kinds of uncertainty¶

The standard decomposition, which Google names explicitly in the 2025-07-29 RLM post:

Aleatoric uncertainty — inherent randomness in the system being modelled. Example: stochastic load demand on a cluster. Collecting more data doesn't reduce it; it's irreducible.
Epistemic uncertainty — uncertainty from limited observation or features. The model has seen too few examples of this region of input space, or the features don't carry enough signal. Collecting more (relevant) data does reduce it.

Practical systems care about both but in different ways: aleatoric sets a noise floor for downstream planning; epistemic tells you where to gather more data or retrain.

How RLMs deliver UQ structurally¶

The 2025-07-29 post's core UQ claim is that text-to-text regression gives UQ for free:

Sample multiple decodes from the same prompt.
Parse each decoded string as a number.
The empirical distribution over parsed numbers approximates P(y | x) — both the central tendency (point prediction = mean / mode) and the spread (uncertainty = width).

Google reports:

RLM-sampled densities track ground-truth KDE curves across different time durations.
Predicted-distribution width correlates with residual squared error — the model is calibrated, not just producing a wider-is-wider artefact.
Calibration lets the RLM be paired with the slow bin-packing simulator as a cheap approximator with an expensive fallback: the uncertainty is the legitimate trigger for the fallback path (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

Why calibration is load-bearing¶

Without calibration:

Always-trust → high-uncertainty predictions are acted on as if they were confident; errors propagate.
Always-fallback → the "fast path" is never taken; the ML approximator delivers no cost savings.
Threshold-based triggering → whatever uncertainty threshold you pick is arbitrary; false-positive and false-negative fallback rates can't be tuned meaningfully.

Calibration — Pr(actual |y - ŷ| < δ) ≈ 1 - α for predicted (1-α)-interval width δ — is what makes the fallback threshold a well-defined knob rather than a guess.

Common UQ techniques (outside LMs)¶

Bayesian models / MCMC — full posterior over parameters; rigorous but expensive.
Bayesian neural networks / variational inference — approximate posterior, cheaper than MCMC.
Deep ensembles — train N models, use disagreement as uncertainty. Simple and effective.
Monte Carlo dropout — dropout at inference time treated as approximate Bayesian inference.
Conformal prediction — distribution-free prediction intervals with coverage guarantees.
Quantile regression — model quantiles directly; no distributional assumption.

The RLM's sampling-based approach falls in the "output distribution is native" camp — similar in spirit to BNNs and conformal, but the mechanism is decoding stochasticity rather than parameter uncertainty.

Downstream uses¶

Fast-path / slow-path routing. Trust the cheap ML predictor when confident; fall back to the authoritative solver when not.
Active learning. Sample new training data where the model is most uncertain.
Risk-aware decision making. Expected utility weighted by predicted-distribution width; discount confident gains against wide-interval risks.
Anomaly / drift detection. Spikes in epistemic uncertainty on recent inputs flag distribution shift.

Seen in¶

sources/2025-07-29-google-simulating-large-systems-with-regression-language-models — RLM recovers P(y | x) from sampled decodes; predicted- distribution width correlates with residual squared error; aleatoric / epistemic decomposition named; calibrated uncertainty enables the cheap-approximator-with-expensive- fallback pattern for Borg bin-packing.