PATTERN Cited by 1 source

All-layer ensemble decoding¶

Pattern¶

At the LLM decoding step, apply the transformer's final projection matrix (LM-head) to every layer's hidden state — not just the final layer's — obtaining one vocabulary distribution per layer. Combine the per-layer distributions (weighted average, contrast, or another ensemble rule) into a single decoding distribution, then sample the next token from that. No new parameters, no fine-tuning, no retrieval — purely a modified decoding function (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

When to apply¶

The model exhibits "popular but wrong" hallucinations. The correct answer is known to the model but gets out-weighted at the final layer by training-data-frequent alternatives (LLM hallucination — Vancouver instead of Victoria, "A × B = C" instead of the discount- preserving "A × B ×").
You have open weights and intermediate hidden states. The pattern requires access to every layer's hidden state and to the LM-head matrix. API-only / closed-weights models can't use it.
You want factuality without retraining. The pattern is training-free; it costs only the extra per-layer LM-head projections at decode time (~4% for SLED vs DoLa; the blog doesn't publish vs-base overhead).
You want composability. Because the pattern modifies only the decoding function, it stacks with other decoding-time interventions (temperature, top-p, other factuality decoders, and in principle speculative decoding).

When NOT to apply¶

Hallucinations driven by corpus gaps. If the model genuinely doesn't know, no amount of layer mixing produces correct output — RAG or fine-tuning is the right lever.
Latency-sensitive single-token serving where even 4% decode-step overhead matters (interactive chat typing-sound feedback, streaming TTS pipelines with tight deadlines).
Throughput-sensitive serving at very large vocab_size where allocating L × vocab_size per-layer distributions per step is memory-prohibitive. Blog doesn't quantify the memory cost; worth measuring.
Closed-weights models. Pattern needs intermediate-layer access.

Canonical instances¶

SLED (Google Research, NeurIPS 2024) — weighted-average across all layer distributions. "Giving more importance to some layers than others." Generalises DoLa from pairwise-contrast to full-ensemble. Up to +16 percentage points accuracy vs base / DoLa on "two challenging datasets", ~4% decoding-time overhead vs DoLa, validated on Gemma 3 / GPT-OSS / Mistral (IT + base).
DoLa (2023, pre-SLED SOTA) — pairwise-contrast between one mature layer and one premature layer. A degenerate two-layer case of the all-layer ensemble family.

Both instances share:

Reuse of the final projection matrix on intermediate hidden states → no new parameters.
Training-free → frozen weights.
Composable with each other and with other decoding-time interventions.

Per-query two-model patterns (patterns/cheap-approximator-with-expensive-fallback, patterns/teacher-student-model-compression) — also two- scale but at the per-query granularity and with a separate second model. All-layer ensemble decoding is single-model internal ensemble; it uses the same model's own layers as the ensemble.
patterns/draft-verify-inference — per-token draft-then-verify at the model-pair granularity; orthogonal to all-layer ensemble which operates within a single model's forward pass.

Why it works (structural argument)¶

The Google Research SLED post frames the structural premise explicitly: transformer layers are specialised, and the final layer's output distribution reflects training-data frequency in a way that can overwrite context-driven correctness present in intermediate layers. The British Columbia / arithmetic worked examples illustrate the mechanism: the correct signal is in the intermediate layers, the final layer has smoothed toward the high-co-occurrence answer, and a weighted average across layers recovers the correct signal (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

This is a single-model instance of the ensemble intuition — averaging multiple "voters" reduces the variance of training-data-idiosyncratic biases while preserving shared factual signal — but implemented by reading the transformer's own intermediate layers as the voters, rather than training multiple independent models.

Seen in¶

sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers — the canonical-wiki instance. SLED is the all-layer weighted- average decoder; DoLa is the pairwise-contrast degenerate case. Post motivates the pattern via the "popular but wrong" failure mode and reports up to +16pp accuracy gain at ~4% decoding-time overhead vs DoLa.