PATTERN Cited by 2 sources
Learned distribution over point prediction¶
When downstream decisions are cost-asymmetric in the prediction error, emit a calibrated distribution from the predictor rather than a point estimate. The consumer picks the quantile matched to the asymmetry; tail queries become well-defined; uncertainty-gated actions become possible.
Intent¶
Point prediction (ŷ = f(x)) is appealing because it's simple
and the consumer API is a number. It fails when either:
- The consumer's loss function is asymmetric in error sign (e.g. under-prediction costs more than over-prediction).
- The consumer's decision depends on a tail quantity (e.g. "is there a 10% chance this exceeds threshold T?").
- The consumer wants to gate expensive actions on prediction confidence (trigger fallback / migration / human review only when the predictor is uncertain).
In any of these regimes, point prediction destroys information the predictor could cheaply have provided. Emitting a distribution preserves the information at marginal cost; the consumer does the quantile / tail / confidence extraction it needs.
Mechanism¶
Pick a representation:
- Parametric. Train the predictor to emit parameters of a fixed family (Gaussian μ, σ; lognormal μ, σ; mixture of K Gaussians). Cheap and compact; brittle if the true distribution doesn't match the family.
- Quantile regression. Predict a fixed set of quantiles (P10, P50, P90, or denser) directly. Nonparametric in shape; costs one output head per quantile; calibration is per- quantile.
- Histogram / discretised output. Predict a probability distribution over bucketed target values. Maps well to classification-style training; bucketing loses precision.
- Sampling-based. Treat the predictor as a generative model; sample N outputs per input; distribution is the empirical sample distribution. Expensive at inference (N-sample cost); natural for LLM-style text-to-text regression (see RLM).
Whichever representation, calibration is load-bearing: P(actual ≤ predicted q-quantile) should match q. Without calibration the downstream quantile queries lie, and the pattern degrades to decorative uncertainty.
Canonical wiki instances¶
-
Learned lifetime distributions in the LAVA VM scheduler. Google's 2025-10-17 LAVA post is titled "Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions" — the pattern is headline (Source: sources/2025-10-17-google-solving-virtual-machine-puzzles-lava). The consumer (scheduler, LARS) uses distribution tails to reason about empty-host preservation and migration-worth tests. A point estimate of "4h" can't answer "P(still running at T+24h)?"; the learned distribution can.
-
RLM-sampled distributions for MIPS-per-GCU prediction on Borg. The 2025-07-29 Google RLM post recovers a distribution via sampled decoding — the predictor emits tokens, many samples, the sample-distribution is the prediction distribution (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models). Used downstream to gate fallback to the slow bin-packer via cheap-approximator-with-expensive-fallback.
The two instances use different representations (parametric / quantile in LAVA's case, sampled empirical in RLM's case) for the same pattern: distributions make downstream risk-aware decisions tractable.
When it's the right shape¶
- Consumer's loss is asymmetric or tail-dependent.
- Consumer wants to gate expensive actions on confidence.
- Predictor can be calibrated cheaply.
- Representation cost (a few quantiles vs one number) is negligible relative to the information value.
When it's the wrong shape¶
- Consumer just needs a point and has no use for uncertainty.
- Calibration is infeasible for the chosen predictor class.
- Representation cost (e.g. N-sample LLM decoding per request) blows the inference budget.
- The target distribution is so tight (near-deterministic) that a point estimate is effectively lossless.
Adjacent patterns¶
- patterns/cheap-approximator-with-expensive-fallback — downstream consumer of distributions; uses distribution width (uncertainty) as the fallback gate.
- patterns/lifetime-aware-rescheduling — another downstream consumer; uses the tail of the distribution to decide whether to migrate.
Seen in¶
- sources/2025-10-17-google-solving-virtual-machine-puzzles-lava — canonical wiki instance at the VM-scheduling layer; LAVA's learned lifetime distributions drive both initial allocation and LARS-triggered rescheduling.
- sources/2025-07-29-google-simulating-large-systems-with-regression-language-models — RLM sampled-decoder recovers distribution over MIPS-per-GCU; calibrated width gates fallback to the slow bin-packer.
Related¶
- concepts/learned-lifetime-distribution
- concepts/uncertainty-quantification
- concepts/aleatoric-uncertainty
- concepts/performance-prediction
- concepts/vm-lifetime-prediction
- patterns/cheap-approximator-with-expensive-fallback
- patterns/lifetime-aware-rescheduling
- systems/lava-vm-scheduler
- systems/regression-language-model