CONCEPT Cited by 1 source
Token verification¶
Token verification is the serving-side primitive that speculative decoding and speculative cascades rely on: given an N-token draft produced by a small drafter model, the large expert model processes the entire draft in one parallel forward pass and emits, per position, a decision whether to accept the drafter's token or replace it with the expert's own preferred token. The primitive is load-bearing because it converts N sequential single-token expert forwards into one parallel expert forward over an N-token prefix, which is strictly cheaper on LLM-serving hardware (weights loaded once, KV cache populated once, attention matmul batched).
The canonical acceptance rule — token-exact match¶
In the canonical form of speculative decoding, the expert's
accept / reject decision at each position t is a token-exact
argmax match: accept iff the drafter's token at t equals the
expert's argmax given the accepted prefix draft[0..t-1]. On
the first t where they differ, the expert cuts the draft,
substitutes its own token at position t, and the loop
continues from there.
The 2025-09-11 Google Research post shows the failure mode this rule has — the drafter can produce a semantically-equivalent alternative that the expert rejects because its argmax is a different token (the "Buzz Aldrin" vs "Edwin Aldrin" worked example). Token-exact match is rigid but has one redeeming property: under standard rejection-sampling discipline, it preserves the expert's output distribution, so quality is provably at least as good as running the expert alone (Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference).
Generalising to probabilistic-match rules¶
Speculative cascades relaxes the acceptance rule to a probabilistic match: accept the drafter's token when it has high enough likelihood under the expert's distribution, not only when it's the argmax. The intent is to keep semantically-equivalent drafts that the argmax rule would throw away. The Google Research post mentions this rule but defers its formal specification to the linked paper, which is not in the scraped raw.
Why parallel verification is cheap — the KV-cache angle¶
The expert's forward pass over an N-token prefix populates its
KV cache for all N positions in one pass.
If the verifier accepts all N, the expert has produced those N
tokens' attention state for free, and generation can continue
from position N+1 with the KV cache fully warm. If it rejects
at position k, the cached K/V for positions 0..k−1 are still
reusable (they matched the expert's own computation path), and
the expert resumes generation from k with most of its
attention work already done. This is why speculative
decoding's per-token amortised cost can be strictly lower than
the expert alone even at modest acceptance rates.
Relationship to other wiki primitives¶
- concepts/speculative-decoding — the consumer of the primitive; its canonical token-exact rule is one acceptance policy.
- concepts/drafter-expert-split — the architectural substrate token verification runs on.
- systems/speculative-cascades — consumes the same primitive with a generalised (probabilistic-match) acceptance rule.
- concepts/kv-cache — the per-layer, per-token K/V store whose parallel-population property is what makes parallel verification cheap.
- patterns/draft-verify-inference — the generalised pattern, at the LLM-token granularity.
Seen in¶
- sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference — canonical wiki articulation of parallel token verification as a reusable primitive, the token-exact canonical rejection rule, and its generalisation via probabilistic matching in speculative cascades.