Skip to content

PATTERN Cited by 2 sources

Learned distribution over point prediction

When downstream decisions are cost-asymmetric in the prediction error, emit a calibrated distribution from the predictor rather than a point estimate. The consumer picks the quantile matched to the asymmetry; tail queries become well-defined; uncertainty-gated actions become possible.

Intent

Point prediction (ŷ = f(x)) is appealing because it's simple and the consumer API is a number. It fails when either:

  • The consumer's loss function is asymmetric in error sign (e.g. under-prediction costs more than over-prediction).
  • The consumer's decision depends on a tail quantity (e.g. "is there a 10% chance this exceeds threshold T?").
  • The consumer wants to gate expensive actions on prediction confidence (trigger fallback / migration / human review only when the predictor is uncertain).

In any of these regimes, point prediction destroys information the predictor could cheaply have provided. Emitting a distribution preserves the information at marginal cost; the consumer does the quantile / tail / confidence extraction it needs.

Mechanism

Pick a representation:

  1. Parametric. Train the predictor to emit parameters of a fixed family (Gaussian μ, σ; lognormal μ, σ; mixture of K Gaussians). Cheap and compact; brittle if the true distribution doesn't match the family.
  2. Quantile regression. Predict a fixed set of quantiles (P10, P50, P90, or denser) directly. Nonparametric in shape; costs one output head per quantile; calibration is per- quantile.
  3. Histogram / discretised output. Predict a probability distribution over bucketed target values. Maps well to classification-style training; bucketing loses precision.
  4. Sampling-based. Treat the predictor as a generative model; sample N outputs per input; distribution is the empirical sample distribution. Expensive at inference (N-sample cost); natural for LLM-style text-to-text regression (see RLM).

Whichever representation, calibration is load-bearing: P(actual ≤ predicted q-quantile) should match q. Without calibration the downstream quantile queries lie, and the pattern degrades to decorative uncertainty.

Canonical wiki instances

The two instances use different representations (parametric / quantile in LAVA's case, sampled empirical in RLM's case) for the same pattern: distributions make downstream risk-aware decisions tractable.

When it's the right shape

  • Consumer's loss is asymmetric or tail-dependent.
  • Consumer wants to gate expensive actions on confidence.
  • Predictor can be calibrated cheaply.
  • Representation cost (a few quantiles vs one number) is negligible relative to the information value.

When it's the wrong shape

  • Consumer just needs a point and has no use for uncertainty.
  • Calibration is infeasible for the chosen predictor class.
  • Representation cost (e.g. N-sample LLM decoding per request) blows the inference budget.
  • The target distribution is so tight (near-deterministic) that a point estimate is effectively lossless.

Adjacent patterns

Seen in

Last updated · 200 distilled / 1,178 read