CONCEPT Cited by 4 sources
Speculative decoding¶
Speculative decoding is an LLM-inference latency-optimization technique: a small, fast drafter model proposes the next N tokens autoregressively, and a large, powerful expert model verifies all N in a single parallel forward pass, accepting the prefix up to the first token that doesn't match its own preferred output and continuing generation from there. The large model's wall-clock cost drops from N sequential single-token forwards to one parallel forward over the draft, so if the drafter is right on a long run of tokens the expert's compute is amortised across many accepted tokens per verify-pass.
Why parallel verification is cheap¶
LLM decoding is dominated by the cost of loading the expert's weights into GPU/TPU HBM per forward pass (memory-bound on long contexts) and by the quadratic attention cost over the prefix. Verifying N drafter tokens in one pass is cheaper than N sequential passes because:
- The weights are loaded once, not N times.
- The KV cache is populated over the whole prefix at once, not incrementally.
- The attention matmul is batched over N positions rather than running N separate one-position forwards.
The throughput win is bounded by the drafter's acceptance rate — if the expert rejects on token 1 nearly always, the pattern degenerates to running the expert anyway plus paying the drafter's wasted forward pass.
The token-exact rejection rule (the canonical form)¶
In the canonical form of speculative decoding, rejection is
token-exact: the expert compares each drafter token to its
own next-token argmax (or samples under a rejection-sampling
rule that preserves the expert's output distribution) and cuts
the draft at the first mismatch. Google Research's 2025-09-11
post notes the structural consequence: the small model can
produce a factually correct, semantically-equivalent answer
and still have its draft rejected because the expert's preferred
first token is different. The "Who is Buzz Aldrin?" worked
example in the post — small-model answer starts "Buzz...",
large model starts "Edwin...", so Buzz ≠ Edwin at position 0
and the full draft is discarded
(Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference).
"Even though the small model produced a good answer, the requirement to match the large model token-by-token forces a rejection. We lose the speed benefit and end up with an answer that is not necessarily superior." (Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference)
Probabilistic-match rules — the generalisation¶
The same post notes that the formal paper defines a probabilistic-match rule that relaxes token-exact rejection to a distribution-aware acceptance predicate — roughly, accept when the drafter's token has high enough likelihood under the expert's distribution, not only when it's the argmax. This is the rule speculative cascades builds on. The raw markdown captured for this source references the paper for the formal statement but does not reproduce it; the wiki does not reconstruct the rule from external sources.
Relationship to other wiki primitives¶
- concepts/drafter-expert-split — the two-model architectural substrate speculative decoding sits on.
- concepts/cascades-llm-inference — the sequential-defer cousin; cascades decide at the whole-response granularity on small-model confidence, where speculative decoding decides at the token granularity on expert-verification.
- concepts/token-verification — the specific primitive: expert evaluates a draft of N tokens in one parallel forward pass and emits an accept / reject per position.
- concepts/kv-cache — the structural reason parallel verification is cheaper than sequential decoding.
- systems/speculative-cascades — the hybrid that keeps parallel verification but swaps the rejection rule.
- patterns/draft-verify-inference — the generalised "cheap generator proposes, expensive verifier confirms" pattern at the LLM-token granularity.
Failure modes¶
- Low acceptance rate — drafter and expert disagree frequently; the verify pass rejects on token 1 most of the time; the drafter's forward-pass compute is net waste.
- Throughput cliff on long drafts — N too large means the verify pass costs like a bigger batch, but most of it is discarded on early rejection; optimal N depends on the empirical acceptance distribution.
- Token-exact rejection throws away equivalents — the failure mode that motivates speculative cascades.
Agentic workloads — where speculative decoding shines¶
Cloudflare's 2026-04-16 post on Workers AI's extra-large-model serving names agentic workloads as the class where speculative decoding pays off the most:
"In agentic use cases, speculative decoding really shines because of the volume of tool calls and structured outputs that models need to generate. A tool call is largely predictable — you know there will be a name, description, and it's wrapped in a JSON envelope." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
The drafter's acceptance rate is high on structurally-predictable generations: JSON envelopes, tool-call schemas, MCP response formats. Each accepted token is a full expert-pass avoided, so high-acceptance regions compound into large end-to-end speedups. For Kimi K2.5, Cloudflare uses NVIDIA's EAGLE-3 (nvidia/Kimi-K2.5-Thinking-Eagle3) as the drafter. The principal tuning dial is N = number of future tokens to draft per verify-pass; optimal N depends on the drafter/expert pair's empirical acceptance distribution.
Google's TPU-codesigned extensions — block verification + tree drafting¶
The 2026-05-28 Google Research I/O 2026 roundup post names two production-deployed Google extensions to speculative decoding, both implemented with TPU architecture-specific optimization:
- Block verification (arXiv:2403.10444) — modifies the verifier granularity from per-token to per-block. The expert accepts or rejects a block of N drafted tokens jointly rather than evaluating each position independently. Increases the expected accepted-token count per verifier pass by relaxing the per-token-mismatch failure-mode of the canonical rule.
- Tree-structured drafting — modifies the drafter's output topology from a sequence of N tokens to a tree of candidate continuations. "Intelligently explores multiple candidate continuations at once and accepts more tokens per step." The verifier processes the tree's shared prefixes in one parallel pass exploiting the KV-cache, so drafter-side breadth is bought at near-constant verifier-side cost.
Together with the 2025-09-11 speculative cascades hybrid (which modifies the verifier's acceptance rule from token-exact to probabilistic-match), the wiki now has three distinct Google Research speculative-decoding extensions, each modifying a different axis:
| Extension | Source | Axis modified |
|---|---|---|
| systems/speculative-cascades | 2025-09-11 | Verifier acceptance rule (token-exact → probabilistic-match) |
| concepts/block-verification | 2026-05-28 | Verifier granularity (token → block) |
| concepts/tree-structured-drafting | 2026-05-28 | Drafter output topology (sequence → tree) |
All three are algorithmically composable. The 2026-05-28 post attributes the current speed of Gemini 3.5 Flash (powering Antigravity and AI Studio in addition to the Gemini consumer surface) to block verification + tree-structured drafting on TPU (Source: sources/2026-05-28-google-a-new-era-of-innovation-google-research-at-io-2026). This makes Gemini 3.5 Flash the wiki's first canonical production-LLM instance of speculative decoding deployed at hyperscale on Google's first-party stack — sibling to Workers AI's deployment of EAGLE-3 + Kimi K2.5 on a third-party stack.
The post is also explicit that the implementation is "highly optimized for Google's TPU architecture, maximizing hardware utilization to deliver substantially faster responses with no loss in quality" — making this the wiki's canonical hardware/software codesign instance at the LLM-serving layer.
Seen in¶
- sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference — pedagogical treatment of speculative decoding as one of the two baseline techniques speculative cascades improves on. Introduces the "Buzz Aldrin" worked example that illustrates the token-exact rejection's semantic-blindness failure mode.
- sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers — the SLED post frames speculative decoding as the canonical latency-optimising instance of decoding-step modification and positions factuality decoding (SLED, DoLa) as the factuality-optimising sibling category: "There have been many famous improvements to the decoding process, such as speculative decoding, which improves the speed at which LLMs generate text. Similarly, it should be possible to employ an analogous method of 'factuality decoding'." Same architectural insertion point, different objectives; both training-free, both serving-time.
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — canonical wiki instance of speculative decoding in production agentic serving: Cloudflare Workers AI uses NVIDIA EAGLE-3 as the drafter for Kimi K2.5; "tool call is largely predictable — you know there will be a name, description, and it's wrapped in a JSON envelope" motivates the acceptance-rate-on-structured-output win. Token-exact rejection rule (canonical form) in use, not the generalised probabilistic-match rule of systems/speculative-cascades.
- sources/2026-05-28-google-a-new-era-of-innovation-google-research-at-io-2026 — canonical wiki instance of speculative decoding in first-party Google production: Gemini 3.5 Flash (also powering Antigravity and AI Studio) accelerated by block verification + tree-structured drafting, TPU-codesigned. "Substantially faster responses with no loss in quality"; no benchmark numbers in the raw capture.
Related¶
- concepts/cascades-llm-inference
- concepts/drafter-expert-split
- concepts/token-verification
- concepts/kv-cache
- systems/speculative-cascades
- systems/eagle-3 — canonical drafter model class in industrial use.
- systems/kimi-k2-5 — canonical target model + EAGLE-3 pair in Cloudflare's stack.
- systems/workers-ai — production agentic-serving deployment.
- patterns/draft-verify-inference
- patterns/cheap-approximator-with-expensive-fallback — sibling "cheap-then-authoritative" pattern at the per-query granularity.
- concepts/llm-decoding-step — the architectural insertion point shared with factuality decoding.
- concepts/factuality-decoding — sibling decode-time category optimising for factuality instead of latency.
- systems/sled — factuality-decoding sibling at the same insertion point.
- concepts/block-verification — Google Research extension on the verifier-granularity axis (token → block); production-deployed on TPU for Gemini 3.5 Flash.
- concepts/tree-structured-drafting — Google Research extension on the drafter-output-topology axis (sequence → tree); production-deployed on TPU for Gemini 3.5 Flash.
- systems/gemini-3-5-flash — first-party Google production speculative-decoding deployment.
- systems/google-tpu — substrate co-designed with Google's speculative-decoding extensions.
- patterns/hardware-software-codesign-for-ml-serving — the codesign-with-substrate framing.