CONCEPT Cited by 1 source
Cascades (LLM inference)¶
Cascades (in the LLM-inference-serving sense) is a latency / cost-optimization technique that routes each request through a small, fast drafter model first and only re-routes to a large, powerful expert model when the drafter's own confidence on its answer is below a threshold. The drafter answers most of the traffic cheaply; the expert is reserved for the requests where the drafter admits it's out of its depth.
Mechanism¶
- Request arrives; drafter generates a candidate response.
- Drafter computes a confidence signal on the candidate (e.g. per-token log-probabilities, sequence-level self-assessment, a calibrated head).
- If confidence ≥ threshold → return the drafter's response to the user.
- Otherwise → discard the drafter's response and invoke the expert on the original prompt from scratch; return the expert's response.
The structural cost — sequential fallback¶
The fast path (high-confidence) is very fast — only the drafter runs. The slow path (low-confidence) pays both the drafter's full forward pass and then the expert's full forward pass from scratch, because the drafter's work doesn't carry forward to the expert: they're two independent model calls. The sequential wait is the structural limitation Google Research's 2025-09-11 post names as motivating the hybrid speculative- cascades design:
"This sequential 'wait-and-see' approach is a fundamental bottleneck." (Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference)
In contrast, speculative decoding uses a single parallel verify pass instead of a re-run-from-scratch, and speculative cascades keep the parallel-verification primitive while generalising the accept / reject rule.
When cascades work well¶
- Drafter's confidence is well-calibrated. If the drafter is over-confident on wrong answers, users see wrong answers on the fast path; if it's under-confident on right answers, the expert fires too often and the cost saving evaporates. This is the same calibration requirement that governs cheap-approximator-with-expensive-fallback at the per-query granularity.
- Traffic distribution is dominated by easy queries. If most real production traffic is within the drafter's competence, the amortised latency and cost drop toward the drafter's alone. If the traffic is adversarially hard, the cascade spends most of its time on the slow path plus a useless drafter pass.
- Quality ceiling is not set by the expert. Cascades can only match the expert on the slow path; the fast path caps at the drafter's quality. If the drafter is good enough for most users most of the time, this is fine; if not, the cost saving comes at a quality cost.
When cascades fall short¶
- Sequential wait on slow path. As above — drafter's forward pass is net waste when the expert has to re-run.
- Coarse granularity. Cascades commit at the whole-response boundary. A prompt where the drafter is right for the first 20 tokens and wrong on the 21st still gets re-run end-to-end under the expert.
Relationship to other wiki primitives¶
- concepts/speculative-decoding — the parallel-verification cousin. Both sit on the same drafter-expert split, but speculative decoding verifies each drafter token in a single parallel expert pass rather than running the expert from scratch.
- systems/speculative-cascades — Google Research's 2025-09-11 hybrid: cascades' confidence-driven accept rule, speculative decoding's parallel verifier — inherits the speed of the latter and the flexibility of the former.
- patterns/cheap-approximator-with-expensive-fallback — the same "cheap-then-authoritative" shape at the per-query granularity across ML-for-systems workloads; cascades are the per-query variant specialised to two LLMs on the same serving stack.
- patterns/draft-verify-inference — generalised "cheap generator, expensive verifier" pattern at the LLM-token level.
Seen in¶
- sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference — cascades introduced as one of the two baseline latency-optimization techniques; the post frames the sequential-wait bottleneck as the motivation for speculative cascades' hybrid design.