Skip to content

PATTERN Cited by 1 source

Draft-verify inference

Draft-verify inference is the generalised pattern behind speculative decoding and speculative cascades: a cheap generator proposes a block of output, an expensive verifier checks the proposal in a single parallel pass, and an acceptance rule decides how much of the proposal to keep. The generator pays for most of the tokens; the verifier intervenes only on the positions the acceptance rule rejects. At LLM-token granularity this is the drafter-expert shape; the same pattern shows up in compilers, databases, and ML-for-systems work where a cheap oracle produces candidate outputs and an authoritative one validates them.

Intent

The verifier's cost usually dominates the system (large LLM expert, authoritative solver, expensive physical simulator). Running it autoregressively or per-query makes the total cost proportional to the verifier's work. The pattern turns that into the generator's cost plus the verifier's rejection rate: if the generator is right most of the time, the amortised cost collapses toward the generator's alone, and the verifier's compute is reserved for the minority of positions where the generator is actually wrong.

Mechanism

  1. Cheap generator produces an N-item proposal (N tokens, N instructions, N rows, N routing decisions).
  2. Verifier processes the whole proposal in one parallel pass, emitting a per-position accept/reject decision against its own preferred output or its own cost model.
  3. Accept the accepted prefix; splice in the verifier's preferred output at the first reject; continue.
  4. Optional: log the rejected positions as training signal to improve the generator for next round.

Why parallel verification is the load-bearing primitive

Sequential re-running of the verifier discards the generator's work on every call (the cascade-on-low-confidence failure mode cascades have). The pattern requires a verifier that can process N positions in parallel with cost closer to one pass than N passes:

  • For LLMs, this is the KV-cache parallel-populate property — one prefix scan produces the attention state for every position.
  • For compilers, this is parallel dead-code / typecheck passes over a generated IR block.
  • For query optimisers, this is parallel cost-estimation over N candidate plans.

Without this property, the pattern degenerates to the sequential cascade and loses its structural advantage.

The acceptance rule is the tuning knob

Two reference points from the 2025-09-11 Google Research post:

  • Token-exact match (canonical speculative decoding): accept iff the generator's output equals the verifier's argmax. Has a distributional-equivalence property (output distribution matches the verifier alone) but discards semantically-equivalent drafts.
  • Probabilistic match (speculative cascades, specification in the paper): accept when the generator's output is likely enough under the verifier's distribution. Recovers the semantic-equivalence cases at the cost of giving up exact distribution preservation (Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference).

The rule is where the pattern's speed / quality trade-off lives. Stricter rule → fewer accepts → slower and more expensive; looser rule → more accepts → faster and cheaper, but more quality risk.

Canonical wiki instance

Google Research's speculative cascades (2025-09-11) — drafter LLM proposes N tokens, expert LLM verifies them in one parallel forward pass, a probabilistic-match acceptance rule keeps semantically- equivalent drafts that token-exact speculative decoding would reject, all on the same drafter-expert split. The Google post walks the failure modes of the two baseline techniques (sequential cascades + token-exact speculative decoding) and positions the hybrid as strictly dominating both on the speed/flexibility frontier (Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference).

Adjacent patterns

  • patterns/cheap-approximator-with-expensive-fallback — the per-query granularity sibling. Cheap ML approximator serves, expensive authoritative solver takes over on high-uncertainty. Draft-verify is the per-token variant of the same economic shape with a per-position acceptance rule instead of a whole-query fallback switch.
  • patterns/teacher-student-model-compression — different two-model shape. Teacher offline, student on the serving substrate, no runtime fallback because the substrate (phone, browser) can't reach the teacher at request time. Draft-verify is the opposite deployment envelope: both generator and verifier are online.
  • patterns/post-inference-verification — more general "generator produces, verifier checks" shape used at ML-output correctness rather than ML-token acceptance; the verifier is an automated-reasoning engine in that pattern, an LLM in this one.

When it fits

  • Cheap generator exists on the same substrate as the expensive verifier (same tokenizer, same deployment).
  • Verifier admits a parallel-over-N-positions pass (KV-cache-shaped, or an SIMD/GPU-friendly cost-model pass).
  • Acceptance rate is high on realistic traffic — generator and verifier agree often enough for the amortised saving to be real.

When it doesn't

  • Adversarial distributions where the generator and verifier disagree on most positions → verifier fires on every token anyway + generator's compute is wasted.
  • Verifier without parallel-pass property — the pattern collapses to a plain cascade.
  • Latency-insensitive batch jobs — the whole motivation is wall-clock latency; throughput-bound offline jobs don't need the hybrid.

Seen in

Last updated · 200 distilled / 1,178 read