Skip to content

PATTERN Cited by 1 source

All-layer ensemble decoding

Pattern

At the LLM decoding step, apply the transformer's final projection matrix (LM-head) to every layer's hidden state — not just the final layer's — obtaining one vocabulary distribution per layer. Combine the per-layer distributions (weighted average, contrast, or another ensemble rule) into a single decoding distribution, then sample the next token from that. No new parameters, no fine-tuning, no retrieval — purely a modified decoding function (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

When to apply

  • The model exhibits "popular but wrong" hallucinations. The correct answer is known to the model but gets out-weighted at the final layer by training-data-frequent alternatives (LLM hallucinationVancouver instead of Victoria, "A × B = C" instead of the discount- preserving "A × B ×").
  • You have open weights and intermediate hidden states. The pattern requires access to every layer's hidden state and to the LM-head matrix. API-only / closed-weights models can't use it.
  • You want factuality without retraining. The pattern is training-free; it costs only the extra per-layer LM-head projections at decode time (~4% for SLED vs DoLa; the blog doesn't publish vs-base overhead).
  • You want composability. Because the pattern modifies only the decoding function, it stacks with other decoding-time interventions (temperature, top-p, other factuality decoders, and in principle speculative decoding).

When NOT to apply

  • Hallucinations driven by corpus gaps. If the model genuinely doesn't know, no amount of layer mixing produces correct output — RAG or fine-tuning is the right lever.
  • Latency-sensitive single-token serving where even 4% decode-step overhead matters (interactive chat typing-sound feedback, streaming TTS pipelines with tight deadlines).
  • Throughput-sensitive serving at very large vocab_size where allocating L × vocab_size per-layer distributions per step is memory-prohibitive. Blog doesn't quantify the memory cost; worth measuring.
  • Closed-weights models. Pattern needs intermediate-layer access.

Canonical instances

  • SLED (Google Research, NeurIPS 2024) — weighted-average across all layer distributions. "Giving more importance to some layers than others." Generalises DoLa from pairwise-contrast to full-ensemble. Up to +16 percentage points accuracy vs base / DoLa on "two challenging datasets", ~4% decoding-time overhead vs DoLa, validated on Gemma 3 / GPT-OSS / Mistral (IT + base).
  • DoLa (2023, pre-SLED SOTA) — pairwise-contrast between one mature layer and one premature layer. A degenerate two-layer case of the all-layer ensemble family.

Both instances share:

  • Reuse of the final projection matrix on intermediate hidden states → no new parameters.
  • Training-free → frozen weights.
  • Composable with each other and with other decoding-time interventions.

Why it works (structural argument)

The Google Research SLED post frames the structural premise explicitly: transformer layers are specialised, and the final layer's output distribution reflects training-data frequency in a way that can overwrite context-driven correctness present in intermediate layers. The British Columbia / arithmetic worked examples illustrate the mechanism: the correct signal is in the intermediate layers, the final layer has smoothed toward the high-co-occurrence answer, and a weighted average across layers recovers the correct signal (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

This is a single-model instance of the ensemble intuition — averaging multiple "voters" reduces the variance of training-data-idiosyncratic biases while preserving shared factual signal — but implemented by reading the transformer's own intermediate layers as the voters, rather than training multiple independent models.

Seen in

Last updated · 200 distilled / 1,178 read