Skip to content

CONCEPT Cited by 1 source

Tree-structured drafting

Tree-structured drafting is a speculative-decoding drafter-side primitive in which the drafter emits a tree of candidate continuations rather than a single linear sequence of N drafted tokens. Multiple candidate prefix- paths share a common prefix in the verifier's parallel forward pass, so the verifier effectively evaluates several candidate sequences for the cost of (roughly) the longest path through the tree. "Intelligently explores multiple candidate continuations at once and accepts more tokens per step" (Source: sources/2026-05-28-google-a-new-era-of-innovation-google-research-at-io-2026).

Why a tree, not a sequence

Canonical speculative decoding has the drafter produce one sequence of N tokens, and the verifier accepts the longest prefix that matches its own preferences. The structural cost of being wrong about the very first token is the entire draft is discarded. Tree drafting addresses this by hedging across multiple candidate first tokens (and onward continuations from each), so the verifier's accept-decision picks the best path through the tree rather than committing to a single drafted line.

The throughput-economics argument:

  • The drafter's compute cost grows roughly as the total node count of the tree (each node is one drafter-forward).
  • The verifier's compute cost grows roughly as the maximum depth of the tree, because the verifier evaluates the tree in one parallel pass exploiting the KV cache over the shared prefixes.
  • Net: the drafter pays for the breadth (more candidate continuations explored), the verifier pays only for the depth (still one pass). When the drafter is small and the verifier is large, this is a favourable trade: spend a little more drafter compute to make the verifier-pass much more likely to accept a long prefix.

The Google I/O 2026 post pairs tree-structured drafting with block verification as the two extensions powering Gemini 3.5 Flash's current speed.

Architectural insertion point

Tree-structured drafting sits on the drafter side of the drafter-expert split. It modifies:

  • The drafter's output topology — tree-of-tokens instead of sequence-of-tokens.
  • The verifier's input shape — the verifier must process a tree-shaped prefix, with shared-prefix attention handling so the parallel pass is cheap.

It does not modify:

  • The verifier acceptance rule (token-exact or probabilistic- match — both compose with tree drafts).
  • The block-vs-token granularity ( block verification composes with tree drafts).
  • The substrate split — drafter and verifier remain two distinct models on the same inference stack.

Implementation considerations

The post does not specify the tree structure in detail (branching factor, depth, beam-width-vs-tree-width trade-off), but the literature on tree-shaped drafting has converged on a few design levers:

  • Branching factor — how many candidate continuations to explore per node. High branching favours acceptance rate; low branching favours drafter compute.
  • Tree depth — total speculation horizon per verifier pass. Deeper trees amortise the verifier's pass cost over more potentially-accepted tokens.
  • Pruning policy — drop unlikely continuations early to keep the drafter's compute bounded.
  • Verifier attention sharing — exploiting the tree's common prefixes so the verifier's KV-cache is populated once per shared subpath rather than per node.

The Google I/O 2026 raw capture doesn't decompose any of these; the implementation detail lives in the underlying papers and TPU-specific kernels.

TPU codesign

The post is explicit that the implementation is "highly optimized for Google's TPU architecture, maximizing hardware utilization." Tree-structured drafting interacts with the TPU's parallel-compute substrate in two ways:

  • Tree-shaped attention is well-suited to TPU's matrix-multiply hardware when the tree is regular (uniform branching) — the verifier's pass can be expressed as one big matmul over the tree's internal structure rather than many small matmuls.
  • Memory locality — the shared prefix in the KV-cache is reused across siblings, reducing HBM bandwidth pressure relative to verifying N independent sequences.

The codesign claim makes this a hardware/software codesign story: tree-shape choices are tuned to the TPU's compute/memory profile, not chosen purely on theoretical acceptance-rate grounds.

Composition with the Google Research family

Extension Source Axis modified
systems/speculative-cascades 2025-09-11 Verifier acceptance rule (token-exact → probabilistic-match)
concepts/block-verification 2026-05-28 Verifier granularity (token → block)
Tree-structured drafting (this page) 2026-05-28 Drafter output topology (sequence → tree)

All three are algorithmically composable. Tree drafts can be verified at block granularity with probabilistic-match rules, in principle.

Seen in

Last updated · 542 distilled / 1,571 read