Skip to content

CONCEPT Cited by 1 source

Drafter-expert split

Drafter-expert split is the architectural primitive under both speculative decoding and cascades: two models are co-hosted on the same LLM-serving stack — a small, fast drafter that produces candidate outputs quickly on its own compute budget, and a large, powerful expert that is either verifying the drafter (speculative decoding / speculative cascades) or taking over from it (cascades). The two roles are orthogonal to how the drafter is obtained (independently trained, distilled from the expert, pruned/quantised variant of the expert); what matters is that drafter compute ≪ expert compute and the two share a tokenizer + vocabulary so the expert can interpret the drafter's output without translation.

Why the split exists

LLM decoding is dominated by expert compute at production scale — per-token latency, GPU/TPU memory residency, and energy cost all scale with the expert's parameter count and context length. The drafter-expert split factors that scarcity out: most of the traffic's arithmetic is done on the drafter, the expert intervenes only on a fraction of tokens or a fraction of requests. The saving is bounded by the expert's intervention rate — how often it has to verify-and-reject (speculative decoding) or re-run-from-scratch (cascades).

What's shared between drafter and expert

For the split to work operationally:

  • Tokenizer / vocabulary. The expert must be able to read the drafter's tokens directly; a tokenizer mismatch forces re-tokenisation, which destroys the per-token parallelism (speculative decoding) or the defer-from-scratch semantics (cascades).
  • Task alignment. The drafter should be trained or prompted to approximate the expert's outputs on the target workload; otherwise the intervention rate is high and the split loses its throughput win.
  • Deployment co-location. Both models need to be resident on the serving stack at the same time; the drafter's weights are additional GPU/TPU memory cost. On long-context workloads this trades against KV cache capacity.

What the post does not disclose

Google Research's 2025-09-11 speculative-cascades post uses the drafter-expert split as the backdrop for both baseline techniques but does not name:

Relationship to other wiki primitives

  • concepts/speculative-decoding — one of the two baseline techniques built on this split; the expert verifies a drafter-produced N-token draft in parallel.
  • concepts/cascades-llm-inference — the other baseline; the expert takes over from the drafter on low drafter confidence.
  • systems/speculative-cascades — Google Research's 2025-09-11 hybrid of the two, still on the drafter-expert substrate.
  • concepts/token-verification — the parallel-expert- over-drafter-tokens primitive used by speculative decoding and speculative cascades.
  • concepts/knowledge-distillation — a common way to produce the drafter (distil from the expert); distinct from the inference-time role the drafter plays in this split.
  • patterns/teacher-student-model-compression — the adjacent deployment pattern where distillation is used to ship a single student to a resource-constrained substrate (phone, browser) with no runtime fallback to the teacher; drafter-expert split is the opposite — both models are online.

Seen in

Last updated · 200 distilled / 1,178 read