Skip to content

CONCEPT Cited by 4 sources

Speculative decoding

Speculative decoding is an LLM-inference latency-optimization technique: a small, fast drafter model proposes the next N tokens autoregressively, and a large, powerful expert model verifies all N in a single parallel forward pass, accepting the prefix up to the first token that doesn't match its own preferred output and continuing generation from there. The large model's wall-clock cost drops from N sequential single-token forwards to one parallel forward over the draft, so if the drafter is right on a long run of tokens the expert's compute is amortised across many accepted tokens per verify-pass.

Why parallel verification is cheap

LLM decoding is dominated by the cost of loading the expert's weights into GPU/TPU HBM per forward pass (memory-bound on long contexts) and by the quadratic attention cost over the prefix. Verifying N drafter tokens in one pass is cheaper than N sequential passes because:

  • The weights are loaded once, not N times.
  • The KV cache is populated over the whole prefix at once, not incrementally.
  • The attention matmul is batched over N positions rather than running N separate one-position forwards.

The throughput win is bounded by the drafter's acceptance rate — if the expert rejects on token 1 nearly always, the pattern degenerates to running the expert anyway plus paying the drafter's wasted forward pass.

The token-exact rejection rule (the canonical form)

In the canonical form of speculative decoding, rejection is token-exact: the expert compares each drafter token to its own next-token argmax (or samples under a rejection-sampling rule that preserves the expert's output distribution) and cuts the draft at the first mismatch. Google Research's 2025-09-11 post notes the structural consequence: the small model can produce a factually correct, semantically-equivalent answer and still have its draft rejected because the expert's preferred first token is different. The "Who is Buzz Aldrin?" worked example in the post — small-model answer starts "Buzz...", large model starts "Edwin...", so Buzz ≠ Edwin at position 0 and the full draft is discarded (Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference).

"Even though the small model produced a good answer, the requirement to match the large model token-by-token forces a rejection. We lose the speed benefit and end up with an answer that is not necessarily superior." (Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference)

Probabilistic-match rules — the generalisation

The same post notes that the formal paper defines a probabilistic-match rule that relaxes token-exact rejection to a distribution-aware acceptance predicate — roughly, accept when the drafter's token has high enough likelihood under the expert's distribution, not only when it's the argmax. This is the rule speculative cascades builds on. The raw markdown captured for this source references the paper for the formal statement but does not reproduce it; the wiki does not reconstruct the rule from external sources.

Relationship to other wiki primitives

  • concepts/drafter-expert-split — the two-model architectural substrate speculative decoding sits on.
  • concepts/cascades-llm-inference — the sequential-defer cousin; cascades decide at the whole-response granularity on small-model confidence, where speculative decoding decides at the token granularity on expert-verification.
  • concepts/token-verification — the specific primitive: expert evaluates a draft of N tokens in one parallel forward pass and emits an accept / reject per position.
  • concepts/kv-cache — the structural reason parallel verification is cheaper than sequential decoding.
  • systems/speculative-cascades — the hybrid that keeps parallel verification but swaps the rejection rule.
  • patterns/draft-verify-inference — the generalised "cheap generator proposes, expensive verifier confirms" pattern at the LLM-token granularity.

Failure modes

  • Low acceptance rate — drafter and expert disagree frequently; the verify pass rejects on token 1 most of the time; the drafter's forward-pass compute is net waste.
  • Throughput cliff on long drafts — N too large means the verify pass costs like a bigger batch, but most of it is discarded on early rejection; optimal N depends on the empirical acceptance distribution.
  • Token-exact rejection throws away equivalents — the failure mode that motivates speculative cascades.

Agentic workloads — where speculative decoding shines

Cloudflare's 2026-04-16 post on Workers AI's extra-large-model serving names agentic workloads as the class where speculative decoding pays off the most:

"In agentic use cases, speculative decoding really shines because of the volume of tool calls and structured outputs that models need to generate. A tool call is largely predictable — you know there will be a name, description, and it's wrapped in a JSON envelope." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

The drafter's acceptance rate is high on structurally-predictable generations: JSON envelopes, tool-call schemas, MCP response formats. Each accepted token is a full expert-pass avoided, so high-acceptance regions compound into large end-to-end speedups. For Kimi K2.5, Cloudflare uses NVIDIA's EAGLE-3 (nvidia/Kimi-K2.5-Thinking-Eagle3) as the drafter. The principal tuning dial is N = number of future tokens to draft per verify-pass; optimal N depends on the drafter/expert pair's empirical acceptance distribution.

Google's TPU-codesigned extensions — block verification + tree drafting

The 2026-05-28 Google Research I/O 2026 roundup post names two production-deployed Google extensions to speculative decoding, both implemented with TPU architecture-specific optimization:

  • Block verification (arXiv:2403.10444) — modifies the verifier granularity from per-token to per-block. The expert accepts or rejects a block of N drafted tokens jointly rather than evaluating each position independently. Increases the expected accepted-token count per verifier pass by relaxing the per-token-mismatch failure-mode of the canonical rule.
  • Tree-structured drafting — modifies the drafter's output topology from a sequence of N tokens to a tree of candidate continuations. "Intelligently explores multiple candidate continuations at once and accepts more tokens per step." The verifier processes the tree's shared prefixes in one parallel pass exploiting the KV-cache, so drafter-side breadth is bought at near-constant verifier-side cost.

Together with the 2025-09-11 speculative cascades hybrid (which modifies the verifier's acceptance rule from token-exact to probabilistic-match), the wiki now has three distinct Google Research speculative-decoding extensions, each modifying a different axis:

Extension Source Axis modified
systems/speculative-cascades 2025-09-11 Verifier acceptance rule (token-exact → probabilistic-match)
concepts/block-verification 2026-05-28 Verifier granularity (token → block)
concepts/tree-structured-drafting 2026-05-28 Drafter output topology (sequence → tree)

All three are algorithmically composable. The 2026-05-28 post attributes the current speed of Gemini 3.5 Flash (powering Antigravity and AI Studio in addition to the Gemini consumer surface) to block verification + tree-structured drafting on TPU (Source: sources/2026-05-28-google-a-new-era-of-innovation-google-research-at-io-2026). This makes Gemini 3.5 Flash the wiki's first canonical production-LLM instance of speculative decoding deployed at hyperscale on Google's first-party stack — sibling to Workers AI's deployment of EAGLE-3 + Kimi K2.5 on a third-party stack.

The post is also explicit that the implementation is "highly optimized for Google's TPU architecture, maximizing hardware utilization to deliver substantially faster responses with no loss in quality" — making this the wiki's canonical hardware/software codesign instance at the LLM-serving layer.

Seen in

Last updated · 542 distilled / 1,571 read