Skip to content

PATTERN Cited by 1 source

Hybrid row decoding

Pattern

For structured outputs with variable-length segments (e.g., rows of recommendations), autoregressively decode only the first few high-value positions, then score all remaining candidates in a single forward pass. This captures the quality benefits of autoregressive conditioning where it matters most (visible/high-attention positions) while avoiding the latency cost of per-token decoding for the remainder.

Mechanism

  1. For each row, autoregressively generate the first K entities (K is small, tuned for quality vs. latency)
  2. After K entities, perform one forward pass conditioned on the generated prefix to obtain logits over all eligible remaining entities
  3. Select top-scoring entities from the batch logits, subject to the same constrained-decoding business rules
  4. Append selected entities to the row and proceed to the next row

Why it works

User attention is heavily concentrated on the first few positions of each row (left-most items on the homepage). These positions benefit most from full autoregressive context. Later positions still get quality from being conditioned on the row prefix but don't need per-step decoding.

Latency impact

Combined with domain-specific tokenization, hybrid row decoding contributed to Netflix GenPage's 20% end-to-end latency reduction vs. the multi-stage baseline (Source: sources/2026-06-29-netflix-genpage-generative-homepage-construction).

Seen in

Last updated ยท 560 distilled / 1,653 read