Skip to content

CONCEPT Cited by 1 source

Token verification

Token verification is the serving-side primitive that speculative decoding and speculative cascades rely on: given an N-token draft produced by a small drafter model, the large expert model processes the entire draft in one parallel forward pass and emits, per position, a decision whether to accept the drafter's token or replace it with the expert's own preferred token. The primitive is load-bearing because it converts N sequential single-token expert forwards into one parallel expert forward over an N-token prefix, which is strictly cheaper on LLM-serving hardware (weights loaded once, KV cache populated once, attention matmul batched).

The canonical acceptance rule — token-exact match

In the canonical form of speculative decoding, the expert's accept / reject decision at each position t is a token-exact argmax match: accept iff the drafter's token at t equals the expert's argmax given the accepted prefix draft[0..t-1]. On the first t where they differ, the expert cuts the draft, substitutes its own token at position t, and the loop continues from there.

The 2025-09-11 Google Research post shows the failure mode this rule has — the drafter can produce a semantically-equivalent alternative that the expert rejects because its argmax is a different token (the "Buzz Aldrin" vs "Edwin Aldrin" worked example). Token-exact match is rigid but has one redeeming property: under standard rejection-sampling discipline, it preserves the expert's output distribution, so quality is provably at least as good as running the expert alone (Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference).

Generalising to probabilistic-match rules

Speculative cascades relaxes the acceptance rule to a probabilistic match: accept the drafter's token when it has high enough likelihood under the expert's distribution, not only when it's the argmax. The intent is to keep semantically-equivalent drafts that the argmax rule would throw away. The Google Research post mentions this rule but defers its formal specification to the linked paper, which is not in the scraped raw.

Why parallel verification is cheap — the KV-cache angle

The expert's forward pass over an N-token prefix populates its KV cache for all N positions in one pass. If the verifier accepts all N, the expert has produced those N tokens' attention state for free, and generation can continue from position N+1 with the KV cache fully warm. If it rejects at position k, the cached K/V for positions 0..k−1 are still reusable (they matched the expert's own computation path), and the expert resumes generation from k with most of its attention work already done. This is why speculative decoding's per-token amortised cost can be strictly lower than the expert alone even at modest acceptance rates.

Relationship to other wiki primitives

Seen in

Last updated · 200 distilled / 1,178 read