PATTERN Cited by 1 source
All-layer ensemble decoding¶
Pattern¶
At the LLM decoding step, apply the transformer's final projection matrix (LM-head) to every layer's hidden state — not just the final layer's — obtaining one vocabulary distribution per layer. Combine the per-layer distributions (weighted average, contrast, or another ensemble rule) into a single decoding distribution, then sample the next token from that. No new parameters, no fine-tuning, no retrieval — purely a modified decoding function (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
When to apply¶
- The model exhibits "popular but wrong" hallucinations. The correct answer is known to the model but gets out-weighted at the final layer by training-data-frequent alternatives (LLM hallucination — Vancouver instead of Victoria, "A × B = C" instead of the discount- preserving "A × B ×").
- You have open weights and intermediate hidden states. The pattern requires access to every layer's hidden state and to the LM-head matrix. API-only / closed-weights models can't use it.
- You want factuality without retraining. The pattern is training-free; it costs only the extra per-layer LM-head projections at decode time (~4% for SLED vs DoLa; the blog doesn't publish vs-base overhead).
- You want composability. Because the pattern modifies only the decoding function, it stacks with other decoding-time interventions (temperature, top-p, other factuality decoders, and in principle speculative decoding).
When NOT to apply¶
- Hallucinations driven by corpus gaps. If the model genuinely doesn't know, no amount of layer mixing produces correct output — RAG or fine-tuning is the right lever.
- Latency-sensitive single-token serving where even 4% decode-step overhead matters (interactive chat typing-sound feedback, streaming TTS pipelines with tight deadlines).
- Throughput-sensitive serving at very large
vocab_sizewhere allocatingL × vocab_sizeper-layer distributions per step is memory-prohibitive. Blog doesn't quantify the memory cost; worth measuring. - Closed-weights models. Pattern needs intermediate-layer access.
Canonical instances¶
- SLED (Google Research, NeurIPS 2024) — weighted-average across all layer distributions. "Giving more importance to some layers than others." Generalises DoLa from pairwise-contrast to full-ensemble. Up to +16 percentage points accuracy vs base / DoLa on "two challenging datasets", ~4% decoding-time overhead vs DoLa, validated on Gemma 3 / GPT-OSS / Mistral (IT + base).
- DoLa (2023, pre-SLED SOTA) — pairwise-contrast between one mature layer and one premature layer. A degenerate two-layer case of the all-layer ensemble family.
Both instances share:
- Reuse of the final projection matrix on intermediate hidden states → no new parameters.
- Training-free → frozen weights.
- Composable with each other and with other decoding-time interventions.
Related patterns¶
- Per-query two-model patterns (patterns/cheap-approximator-with-expensive-fallback, patterns/teacher-student-model-compression) — also two- scale but at the per-query granularity and with a separate second model. All-layer ensemble decoding is single-model internal ensemble; it uses the same model's own layers as the ensemble.
- patterns/draft-verify-inference — per-token draft-then-verify at the model-pair granularity; orthogonal to all-layer ensemble which operates within a single model's forward pass.
Why it works (structural argument)¶
The Google Research SLED post frames the structural premise explicitly: transformer layers are specialised, and the final layer's output distribution reflects training-data frequency in a way that can overwrite context-driven correctness present in intermediate layers. The British Columbia / arithmetic worked examples illustrate the mechanism: the correct signal is in the intermediate layers, the final layer has smoothed toward the high-co-occurrence answer, and a weighted average across layers recovers the correct signal (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
This is a single-model instance of the ensemble intuition — averaging multiple "voters" reduces the variance of training-data-idiosyncratic biases while preserving shared factual signal — but implemented by reading the transformer's own intermediate layers as the voters, rather than training multiple independent models.
Seen in¶
- sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers — the canonical-wiki instance. SLED is the all-layer weighted- average decoder; DoLa is the pairwise-contrast degenerate case. Post motivates the pattern via the "popular but wrong" failure mode and reports up to +16pp accuracy gain at ~4% decoding-time overhead vs DoLa.