PATTERN Cited by 5 sources

Cheap approximator with expensive fallback¶

Serve most queries with a fast, low-cost ML approximator; fall back to the slow authoritative solver only when the approximator reports high uncertainty. The calibration of the approximator's uncertainty is what makes the fallback trigger legitimate — without calibrated uncertainty, the pattern degenerates to always-fast-but-sometimes-wrong or always-slow.

Intent¶

Many production systems have an authoritative but expensive solver / simulator / model at the heart of a control loop:

Cluster schedulers running a combinatorial bin-packer.
Query optimisers running cost estimation.
Routing systems running shortest-path / min-cost-flow.
Compilers running cost-model-driven instruction selection.

Running the authoritative solver every time is too slow for the outer loop's latency budget, but running a naive approximator silently returns wrong answers on the cases where approximation breaks down.

The pattern is: build a fast approximator that also knows when it doesn't know (calibrated uncertainty), and gate the authoritative slow path on that uncertainty.

Mechanism¶

Train an approximator f_fast(x) → (ŷ, uncertainty(ŷ)) against outputs of the authoritative solver f_slow(x).
Serve queries through f_fast.
If uncertainty(ŷ) < threshold, return ŷ.
Otherwise, invoke f_slow(x) and return the authoritative answer.
Log / feed back the high-uncertainty cases as new training data — next round's approximator should be able to handle them.

The calibration requirement¶

Calibration is load-bearing. The pattern requires that P(|ŷ - y_true| > δ | uncertainty(ŷ)) matches the advertised interval width, so the fallback threshold is a well-defined operational knob.

Without calibration:

If uncertainty is under-reported, dangerous predictions slip through the fast path.
If uncertainty is over-reported, the slow path fires too often; the cost savings evaporate.
Either way, tuning the threshold becomes guesswork.

The 2025-07-29 Google post's load-bearing empirical claim for this pattern is: "the RLM's prediction uncertainty is correlated with residual squared error, allowing us to quantify the model's confidence in its predictions" (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models). That correlation is what makes the fallback threshold meaningful.

Canonical wiki instance¶

Google's 2025-07-29 RLM deployment on Borg MIPS-per-GCU prediction:

Fast path: 60M-param RLM reading YAML/JSON cluster state, emitting predicted MIPS per GCU with sampled-distribution-width as uncertainty.
Slow path: Borg's specialised bin-packing algorithm running inside the digital-twin backtester — the authoritative answer.
Fallback trigger: high predicted-distribution width, calibrated against residual squared error.

Google frames this explicitly: "when uncertain, the predicted distribution is broader, signalling that the predictions should be treated with more caution. This enables us to know when to rely more heavily on the regressor, and when to potentially fall-back to slower but more accurate bin-packing simulations in managing the compute clusters." (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

When it's the right shape¶

f_slow is correct but too slow for the inner loop.
A meaningful fraction of queries are "easy" — the approximator is within tolerance on them.
Uncertainty-calibrated approximators are feasible (ensembles, sampled decoders, quantile regression, conformal prediction).
The fast/slow cost ratio is large enough that even a conservative fallback threshold still pays off.

When it's the wrong shape¶

Every query is equally "hard" — the fast path is rarely accurate; fallback fires on most queries anyway.
f_slow is itself close to real-time — no speedup to capture.
Downstream consumers can't tolerate any fallback latency variance — the system is better off running f_slow always.
Uncertainty calibration can't be achieved for the chosen approximator class.

Adjacent shapes on the wiki¶

patterns/approver-discarder-filter — similar two-tier structure (cheap filter + authoritative gate) in policy evaluation. Different signal source (rule classification, not calibrated uncertainty), same economics.
Speed-accuracy trade-off — the broader principle. This pattern is one answer: use the fast path on the confident regime, the slow path on the uncertain regime, so the operating curve is a min of the two rather than a fixed interpolation.

Contrast with teacher-student model compression ¶

Both patterns address the same economic force — an expensive reference computation and a cheap runtime budget — and both use a small model to stand in for a large one. The distinguishing axis is whether the expensive computation is reachable at serving time.

Axis	Cheap-approximator-with-fallback	Teacher-student model compression
Expensive computation reachable at serving time	Yes	No
Small model has to cover full distribution	No — fallback covers OOD	Yes
Calibrated uncertainty load-bearing	Yes (gates fallback)	Usually not required
Canonical wiki instance	Google RLM / Borg bin-packer	YouTube real-time gen-AI effects (mobile)

When the serving substrate is a datacentre (RLM / Borg), the expensive bin-packer sits one RPC away; calibrated uncertainty makes the runtime fallback legitimate. When the serving substrate is a phone (YouTube effects), the teacher is unreachable in real time; the student has to be good enough on the full distribution because there's no runtime escape hatch. Substrate drives the choice (Source: sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai).

Seen in¶

sources/2025-07-29-google-simulating-large-systems-with-regression-language-models — RLM + Borg bin-packing simulator; calibrated-uncertainty- gated fallback.
sources/2025-08-21-google-from-massive-models-to-mobile-magic-tech-behind-youtube-real-time-generative-ai — sibling-but-different instance: YouTube's mobile generative AI effects use the same cheap-stand-in-for-expensive shape via knowledge distillation, but have no runtime fallback to the teacher because the phone can't reach the teacher at camera-frame rate. Sharpens the pattern's substrate-dependence argument.
sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference — per-token granularity cousin. LLM speculative decoding / speculative cascades apply the same "cheap generator, expensive verifier" economics at the token boundary via draft-verify inference; the acceptance rule per position takes the place of per-query uncertainty calibration.
sources/2025-10-17-google-solving-virtual-machine-puzzles-lava — sibling pattern: LAVA's LARS uses calibrated learned lifetime distributions as control signal, but the consequence of high uncertainty is "wait for the prediction to update via continuous reprediction and possibly migrate later", not "call the slow solver now". Same calibrated-uncertainty-as-load-bearing discipline, different fallback action; clarifies this pattern's "expensive fallback" axis is really "any uncertainty-gated expensive action", not specifically "run the slow solver".
sources/2026-04-07-mongodb-predictive-auto-scaling-an-experiment — control-loop sibling at managed-database-tier autoscaling. MongoDB Atlas's predictive auto-scaler uses the self-censoring Long-Term Forecaster as the "cheap approximator"; when its recent-accuracy gate fails, the fallback is another predictor (the Short-Term Forecaster — patterns/short-plus-long-term-forecaster) rather than an expensive authoritative solver. Reactive auto-scaling is the ultimate backstop but below both predictors. Variant topology (two predictors + reactive backstop vs. predictor
slow solver) — same calibration-as-gate discipline, richer fallback hierarchy. Reinforces the "the pattern's axis is uncertainty-gated expensive action" reading from the LAVA instance above: in MongoDB's case the "expensive action" is act on the prediction at all, and the uncertainty gate decides between tiers of decreasingly-trusted predictors.

concepts/uncertainty-quantification
concepts/aleatoric-uncertainty
concepts/performance-prediction
concepts/bin-packing
concepts/digital-twin-backtesting
systems/regression-language-model
systems/borg
concepts/speed-accuracy-tradeoff
patterns/approver-discarder-filter
patterns/teacher-student-model-compression — sibling pattern without a runtime fallback; use when the serving substrate can't reach the expensive reference at request time.
concepts/knowledge-distillation
patterns/draft-verify-inference — per-token-granularity cousin for LLM serving; uses a per-position acceptance rule in place of per-query uncertainty.
patterns/lifetime-aware-rescheduling — sibling pattern at the cluster-scheduler layer; uses calibrated lifetime distributions as control signal, but the "expensive action" is migrating VMs, not calling a slow solver.
patterns/learned-distribution-over-point-prediction — the representation-side pattern this one often consumes.
concepts/continuous-reprediction
systems/lava-vm-scheduler
concepts/speculative-decoding
systems/speculative-cascades