CONCEPT Cited by 3 sources
Speculative decoding¶
Speculative decoding is an LLM-inference latency-optimization technique: a small, fast drafter model proposes the next N tokens autoregressively, and a large, powerful expert model verifies all N in a single parallel forward pass, accepting the prefix up to the first token that doesn't match its own preferred output and continuing generation from there. The large model's wall-clock cost drops from N sequential single-token forwards to one parallel forward over the draft, so if the drafter is right on a long run of tokens the expert's compute is amortised across many accepted tokens per verify-pass.
Why parallel verification is cheap¶
LLM decoding is dominated by the cost of loading the expert's weights into GPU/TPU HBM per forward pass (memory-bound on long contexts) and by the quadratic attention cost over the prefix. Verifying N drafter tokens in one pass is cheaper than N sequential passes because:
- The weights are loaded once, not N times.
- The KV cache is populated over the whole prefix at once, not incrementally.
- The attention matmul is batched over N positions rather than running N separate one-position forwards.
The throughput win is bounded by the drafter's acceptance rate — if the expert rejects on token 1 nearly always, the pattern degenerates to running the expert anyway plus paying the drafter's wasted forward pass.
The token-exact rejection rule (the canonical form)¶
In the canonical form of speculative decoding, rejection is
token-exact: the expert compares each drafter token to its
own next-token argmax (or samples under a rejection-sampling
rule that preserves the expert's output distribution) and cuts
the draft at the first mismatch. Google Research's 2025-09-11
post notes the structural consequence: the small model can
produce a factually correct, semantically-equivalent answer
and still have its draft rejected because the expert's preferred
first token is different. The "Who is Buzz Aldrin?" worked
example in the post — small-model answer starts "Buzz...",
large model starts "Edwin...", so Buzz ≠ Edwin at position 0
and the full draft is discarded
(Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference).
"Even though the small model produced a good answer, the requirement to match the large model token-by-token forces a rejection. We lose the speed benefit and end up with an answer that is not necessarily superior." (Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference)
Probabilistic-match rules — the generalisation¶
The same post notes that the formal paper defines a probabilistic-match rule that relaxes token-exact rejection to a distribution-aware acceptance predicate — roughly, accept when the drafter's token has high enough likelihood under the expert's distribution, not only when it's the argmax. This is the rule speculative cascades builds on. The raw markdown captured for this source references the paper for the formal statement but does not reproduce it; the wiki does not reconstruct the rule from external sources.
Relationship to other wiki primitives¶
- concepts/drafter-expert-split — the two-model architectural substrate speculative decoding sits on.
- concepts/cascades-llm-inference — the sequential-defer cousin; cascades decide at the whole-response granularity on small-model confidence, where speculative decoding decides at the token granularity on expert-verification.
- concepts/token-verification — the specific primitive: expert evaluates a draft of N tokens in one parallel forward pass and emits an accept / reject per position.
- concepts/kv-cache — the structural reason parallel verification is cheaper than sequential decoding.
- systems/speculative-cascades — the hybrid that keeps parallel verification but swaps the rejection rule.
- patterns/draft-verify-inference — the generalised "cheap generator proposes, expensive verifier confirms" pattern at the LLM-token granularity.
Failure modes¶
- Low acceptance rate — drafter and expert disagree frequently; the verify pass rejects on token 1 most of the time; the drafter's forward-pass compute is net waste.
- Throughput cliff on long drafts — N too large means the verify pass costs like a bigger batch, but most of it is discarded on early rejection; optimal N depends on the empirical acceptance distribution.
- Token-exact rejection throws away equivalents — the failure mode that motivates speculative cascades.
Agentic workloads — where speculative decoding shines¶
Cloudflare's 2026-04-16 post on Workers AI's extra-large-model serving names agentic workloads as the class where speculative decoding pays off the most:
"In agentic use cases, speculative decoding really shines because of the volume of tool calls and structured outputs that models need to generate. A tool call is largely predictable — you know there will be a name, description, and it's wrapped in a JSON envelope." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
The drafter's acceptance rate is high on structurally-predictable generations: JSON envelopes, tool-call schemas, MCP response formats. Each accepted token is a full expert-pass avoided, so high-acceptance regions compound into large end-to-end speedups. For Kimi K2.5, Cloudflare uses NVIDIA's EAGLE-3 (nvidia/Kimi-K2.5-Thinking-Eagle3) as the drafter. The principal tuning dial is N = number of future tokens to draft per verify-pass; optimal N depends on the drafter/expert pair's empirical acceptance distribution.
Seen in¶
- sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference — pedagogical treatment of speculative decoding as one of the two baseline techniques speculative cascades improves on. Introduces the "Buzz Aldrin" worked example that illustrates the token-exact rejection's semantic-blindness failure mode.
- sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers — the SLED post frames speculative decoding as the canonical latency-optimising instance of decoding-step modification and positions factuality decoding (SLED, DoLa) as the factuality-optimising sibling category: "There have been many famous improvements to the decoding process, such as speculative decoding, which improves the speed at which LLMs generate text. Similarly, it should be possible to employ an analogous method of 'factuality decoding'." Same architectural insertion point, different objectives; both training-free, both serving-time.
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — canonical wiki instance of speculative decoding in production agentic serving: Cloudflare Workers AI uses NVIDIA EAGLE-3 as the drafter for Kimi K2.5; "tool call is largely predictable — you know there will be a name, description, and it's wrapped in a JSON envelope" motivates the acceptance-rate-on-structured-output win. Token-exact rejection rule (canonical form) in use, not the generalised probabilistic-match rule of systems/speculative-cascades.
Related¶
- concepts/cascades-llm-inference
- concepts/drafter-expert-split
- concepts/token-verification
- concepts/kv-cache
- systems/speculative-cascades
- systems/eagle-3 — canonical drafter model class in industrial use.
- systems/kimi-k2-5 — canonical target model + EAGLE-3 pair in Cloudflare's stack.
- systems/workers-ai — production agentic-serving deployment.
- patterns/draft-verify-inference
- patterns/cheap-approximator-with-expensive-fallback — sibling "cheap-then-authoritative" pattern at the per-query granularity.
- concepts/llm-decoding-step — the architectural insertion point shared with factuality decoding.
- concepts/factuality-decoding — sibling decode-time category optimising for factuality instead of latency.
- systems/sled — factuality-decoding sibling at the same insertion point.