Skip to content

CONCEPT Cited by 2 sources

Drafter-expert split

Drafter-expert split is the architectural primitive under both speculative decoding and cascades: two models are co-hosted on the same LLM-serving stack — a small, fast drafter that produces candidate outputs quickly on its own compute budget, and a large, powerful expert that is either verifying the drafter (speculative decoding / speculative cascades) or taking over from it (cascades). The two roles are orthogonal to how the drafter is obtained (independently trained, distilled from the expert, pruned/quantised variant of the expert); what matters is that drafter compute ≪ expert compute and the two share a tokenizer + vocabulary so the expert can interpret the drafter's output without translation.

Why the split exists

LLM decoding is dominated by expert compute at production scale — per-token latency, GPU/TPU memory residency, and energy cost all scale with the expert's parameter count and context length. The drafter-expert split factors that scarcity out: most of the traffic's arithmetic is done on the drafter, the expert intervenes only on a fraction of tokens or a fraction of requests. The saving is bounded by the expert's intervention rate — how often it has to verify-and-reject (speculative decoding) or re-run-from-scratch (cascades).

What's shared between drafter and expert

For the split to work operationally:

  • Tokenizer / vocabulary. The expert must be able to read the drafter's tokens directly; a tokenizer mismatch forces re-tokenisation, which destroys the per-token parallelism (speculative decoding) or the defer-from-scratch semantics (cascades).
  • Task alignment. The drafter should be trained or prompted to approximate the expert's outputs on the target workload; otherwise the intervention rate is high and the split loses its throughput win.
  • Deployment co-location. Both models need to be resident on the serving stack at the same time; the drafter's weights are additional GPU/TPU memory cost. On long-context workloads this trades against KV cache capacity.

What the post does not disclose

Google Research's 2025-09-11 speculative-cascades post uses the drafter-expert split as the backdrop for both baseline techniques but does not name:

Relationship to other wiki primitives

  • concepts/speculative-decoding — one of the two baseline techniques built on this split; the expert verifies a drafter-produced N-token draft in parallel.
  • concepts/cascades-llm-inference — the other baseline; the expert takes over from the drafter on low drafter confidence.
  • systems/speculative-cascades — Google Research's 2025-09-11 hybrid of the two, still on the drafter-expert substrate.
  • concepts/token-verification — the parallel-expert- over-drafter-tokens primitive used by speculative decoding and speculative cascades.
  • concepts/knowledge-distillation — a common way to produce the drafter (distil from the expert); distinct from the inference-time role the drafter plays in this split.
  • patterns/teacher-student-model-compression — the adjacent deployment pattern where distillation is used to ship a single student to a resource-constrained substrate (phone, browser) with no runtime fallback to the teacher; drafter-expert split is the opposite — both models are online.

Seen in

  • sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference — drafter and expert as the two roles in the "Who is Buzz Aldrin?" walkthrough motivating both speculative decoding and cascades, with speculative cascades keeping the same split and changing only the accept/reject rule.

  • sources/2026-02-19-lyft-scaling-localization-with-aitask-layer instance (not inference-layer). Lyft's AI localization pipeline applies the same two-model architectural primitive at the translation-task layer: a fast non- reasoning Drafter generates N=3 candidate translations; a reasoning-focused Evaluator grades them on a 4-dim rubric. The roles, model-tier reasoning, and savings argument ("drafter compute ≪ expert compute") are identical — drafter does most tokens, expert intervenes where it counts. What differs from the speculative-decoding form: the expert is a judge, not a verifier — it emits a rubric grade + critique text that can feed refinement, rather than a token- level accept/reject. Lyft also articulates the self-approval-bias argument for the separation, which is implicit in the inference-layer literature but rarely named.

Last updated · 542 distilled / 1,571 read