PATTERN Cited by 1 source
Hybrid row decoding¶
Pattern¶
For structured outputs with variable-length segments (e.g., rows of recommendations), autoregressively decode only the first few high-value positions, then score all remaining candidates in a single forward pass. This captures the quality benefits of autoregressive conditioning where it matters most (visible/high-attention positions) while avoiding the latency cost of per-token decoding for the remainder.
Mechanism¶
- For each row, autoregressively generate the first K entities (K is small, tuned for quality vs. latency)
- After K entities, perform one forward pass conditioned on the generated prefix to obtain logits over all eligible remaining entities
- Select top-scoring entities from the batch logits, subject to the same constrained-decoding business rules
- Append selected entities to the row and proceed to the next row
Why it works¶
User attention is heavily concentrated on the first few positions of each row (left-most items on the homepage). These positions benefit most from full autoregressive context. Later positions still get quality from being conditioned on the row prefix but don't need per-step decoding.
Latency impact¶
Combined with domain-specific tokenization, hybrid row decoding contributed to Netflix GenPage's 20% end-to-end latency reduction vs. the multi-stage baseline (Source: sources/2026-06-29-netflix-genpage-generative-homepage-construction).