CONCEPT Cited by 1 source

Block verification¶

Block verification is a speculative-decoding verification primitive in which the expert model accepts or rejects a block of N drafted tokens jointly, rather than evaluating each token's accept/reject decision independently. The decision is taken at block granularity, with the formal rule given in arXiv:2403.10444 — Google Research's named block-verification algorithm extending the speculative-decoding family.

Why it matters¶

In the canonical speculative decoding form, rejection is token-exact and per-position: the verifier scans the drafted block left-to-right and cuts the draft at the first token whose argmax (or rejection-sampled likelihood) doesn't match the drafter. The structural failure mode is that even when the full block as a whole is a fine continuation, the per-token rule discards the suffix from the first per-token mismatch. Block verification reframes the acceptance question as "is this whole block, as a unit, a good continuation under the expert's distribution?" — the answer can be "yes" even when an interior token would have been rejected by the per-token rule, increasing the expected number of accepted tokens per verifier pass.

The Google I/O 2026 post pairs block verification with tree-structured drafting as the two extensions powering Gemini 3.5 Flash's current speed:

"Our research teams developed new techniques building on [speculative decoding] — including block verification and tree-structured drafting, which intelligently explores multiple candidate continuations at once and accepts more tokens per step." (Source: sources/2026-05-28-google-a-new-era-of-innovation-google-research-at-io-2026)

Architectural insertion point¶

Block verification sits at the same decoding-step insertion point as the rest of the speculative-decoding family: it modifies the verifier's accept/reject logic over a parallel-evaluated draft. It does not modify:

The drafter-expert split — the two-model substrate is unchanged.
The drafter's draft-generation algorithm — block verification is drafter-agnostic; it can compose with linear drafters or tree drafters.
The KV-cache population pattern — the expert's prefix is still populated in one parallel pass.

What changes is the acceptance contract: from per-position to per-block. The block-acceptance rule is parameterised on the block size N and the matching predicate (token-exact within-block? distribution-match within-block? expected-quality threshold?). The formal specification is given in the cited paper, not reproduced in the raw post.

Composition with the Google Research family¶

Three Google Research speculative-decoding extensions are in the wiki, each modifying a different axis:

Extension	Source	Axis modified
systems/speculative-cascades	2025-09-11	Verifier acceptance rule (token-exact → probabilistic-match)
Block verification (this page)	2026-05-28	Verifier granularity (token → block)
concepts/tree-structured-drafting	2026-05-28	Drafter output topology (sequence → tree)

The three are algorithmically composable — block-level acceptance against probabilistic-match rules over tree-shaped drafts is a coherent design point, even if the wiki has no canonical instance specifying this full composition.

TPU codesign¶

The Google I/O post is explicit that the implementation is "highly optimized for Google's TPU architecture, maximizing hardware utilization." This makes block verification a hardware/software codesign story: the algorithmic choice (block-level acceptance) is co-tuned to the substrate's parallel-compute and on-chip-memory characteristics, rather than chosen purely on theoretical acceptance-rate grounds. Same algorithm on a different substrate would have different optimal block sizes; the "substantially faster" claim is implementation + substrate-specific.

Failure modes (general)¶

The raw capture doesn't enumerate failure modes specifically for block verification, but the speculative-decoding-family failure modes apply with the obvious modifications:

Block size too large — the verify-pass is more expensive per pass; if the block-acceptance rule rejects, more drafter work is wasted.
Block size too small — the per-token-vs-per-block distinction collapses; the win over canonical speculative decoding shrinks.
Acceptance-rate-distribution-dependent — like all speculative-decoding variants, the throughput win depends on how often the drafter and expert agree at the block-acceptance granularity; agentic / structured-output workloads (where the drafter's acceptance rate is high — see the canonical-form discussion) benefit most.

Seen in¶

sources/2026-05-28-google-a-new-era-of-innovation-google-research-at-io-2026 — Google Research I/O 2026 roundup post names block verification (with arXiv:2403.10444 link) as one of two speculative-decoding extensions powering Gemini 3.5 Flash's current speed. Implementation "highly optimized for Google's TPU architecture." No throughput numbers, no benchmark deltas in the raw capture.

concepts/speculative-decoding — the parent technique.
concepts/tree-structured-drafting — sibling Google Research extension on the drafter axis.
concepts/drafter-expert-split — the substrate.
concepts/token-verification — the per-token primitive that block verification generalises.
concepts/llm-decoding-step — the shared architectural insertion point.
concepts/kv-cache — the structural reason parallel verification is cheap.
systems/speculative-cascades — sibling Google Research extension on the verifier-rule axis.
systems/gemini-3-5-flash — the production serving target.
systems/google-tpu — the substrate the implementation is co-designed with.
patterns/draft-verify-inference — the generalised cheap-generator / expensive-verifier pattern.