Skip to content

PATTERN Cited by 5 sources

Cheap approximator with expensive fallback

Serve most queries with a fast, low-cost ML approximator; fall back to the slow authoritative solver only when the approximator reports high uncertainty. The calibration of the approximator's uncertainty is what makes the fallback trigger legitimate — without calibrated uncertainty, the pattern degenerates to always-fast-but-sometimes-wrong or always-slow.

Intent

Many production systems have an authoritative but expensive solver / simulator / model at the heart of a control loop:

  • Cluster schedulers running a combinatorial bin-packer.
  • Query optimisers running cost estimation.
  • Routing systems running shortest-path / min-cost-flow.
  • Compilers running cost-model-driven instruction selection.

Running the authoritative solver every time is too slow for the outer loop's latency budget, but running a naive approximator silently returns wrong answers on the cases where approximation breaks down.

The pattern is: build a fast approximator that also knows when it doesn't know (calibrated uncertainty), and gate the authoritative slow path on that uncertainty.

Mechanism

  1. Train an approximator f_fast(x) → (ŷ, uncertainty(ŷ)) against outputs of the authoritative solver f_slow(x).
  2. Serve queries through f_fast.
  3. If uncertainty(ŷ) < threshold, return ŷ.
  4. Otherwise, invoke f_slow(x) and return the authoritative answer.
  5. Log / feed back the high-uncertainty cases as new training data — next round's approximator should be able to handle them.

The calibration requirement

Calibration is load-bearing. The pattern requires that P(|ŷ - y_true| > δ | uncertainty(ŷ)) matches the advertised interval width, so the fallback threshold is a well-defined operational knob.

Without calibration:

  • If uncertainty is under-reported, dangerous predictions slip through the fast path.
  • If uncertainty is over-reported, the slow path fires too often; the cost savings evaporate.
  • Either way, tuning the threshold becomes guesswork.

The 2025-07-29 Google post's load-bearing empirical claim for this pattern is: "the RLM's prediction uncertainty is correlated with residual squared error, allowing us to quantify the model's confidence in its predictions" (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models). That correlation is what makes the fallback threshold meaningful.

Canonical wiki instance

Google's 2025-07-29 RLM deployment on Borg MIPS-per-GCU prediction:

  • Fast path: 60M-param RLM reading YAML/JSON cluster state, emitting predicted MIPS per GCU with sampled-distribution-width as uncertainty.
  • Slow path: Borg's specialised bin-packing algorithm running inside the digital-twin backtester — the authoritative answer.
  • Fallback trigger: high predicted-distribution width, calibrated against residual squared error.

Google frames this explicitly: "when uncertain, the predicted distribution is broader, signalling that the predictions should be treated with more caution. This enables us to know when to rely more heavily on the regressor, and when to potentially fall-back to slower but more accurate bin-packing simulations in managing the compute clusters." (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

When it's the right shape

  • f_slow is correct but too slow for the inner loop.
  • A meaningful fraction of queries are "easy" — the approximator is within tolerance on them.
  • Uncertainty-calibrated approximators are feasible (ensembles, sampled decoders, quantile regression, conformal prediction).
  • The fast/slow cost ratio is large enough that even a conservative fallback threshold still pays off.

When it's the wrong shape

  • Every query is equally "hard" — the fast path is rarely accurate; fallback fires on most queries anyway.
  • f_slow is itself close to real-time — no speedup to capture.
  • Downstream consumers can't tolerate any fallback latency variance — the system is better off running f_slow always.
  • Uncertainty calibration can't be achieved for the chosen approximator class.

Adjacent shapes on the wiki

  • patterns/approver-discarder-filter — similar two-tier structure (cheap filter + authoritative gate) in policy evaluation. Different signal source (rule classification, not calibrated uncertainty), same economics.
  • Speed-accuracy trade-off — the broader principle. This pattern is one answer: use the fast path on the confident regime, the slow path on the uncertain regime, so the operating curve is a min of the two rather than a fixed interpolation.

Contrast with teacher-student model compression

Both patterns address the same economic force — an expensive reference computation and a cheap runtime budget — and both use a small model to stand in for a large one. The distinguishing axis is whether the expensive computation is reachable at serving time.

Axis Cheap-approximator-with-fallback Teacher-student model compression
Expensive computation reachable at serving time Yes No
Small model has to cover full distribution No — fallback covers OOD Yes
Calibrated uncertainty load-bearing Yes (gates fallback) Usually not required
Canonical wiki instance Google RLM / Borg bin-packer YouTube real-time gen-AI effects (mobile)

When the serving substrate is a datacentre (RLM / Borg), the expensive bin-packer sits one RPC away; calibrated uncertainty makes the runtime fallback legitimate. When the serving substrate is a phone (YouTube effects), the teacher is unreachable in real time; the student has to be good enough on the full distribution because there's no runtime escape hatch. Substrate drives the choice (Source: sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai).

Seen in

Last updated · 200 distilled / 1,178 read