GOOGLE 2025-09-17 Tier 1

Google Research — Making LLMs more accurate by using all of their layers (SLED)¶

Summary¶

Google Research introduces SLED (Self Logits Evolution Decoding) — a factuality decoding method that improves LLM accuracy by using the early-exit logits from every transformer layer instead of relying solely on the final layer. The mechanism: reuse the transformer's final projection matrix (the weight matrix that maps final-layer hidden states to vocabulary logits) on the hidden states produced by each intermediate layer, producing one next-token distribution per layer over the same vocabulary. SLED then takes a weighted average across all per-layer distributions to pick the next token, rather than reading the token off the last layer alone. The authors frame this as a sibling of speculative decoding — both are modifications of the LLM's decoding step (the final phase of text generation) and both are applied at serving time without retraining. Where speculative decoding attacks latency, SLED attacks factuality (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Structurally the post is a research walkthrough, not a production retrospective. It motivates the design with two worked examples (the "capital of British Columbia" multiple-choice case — final layer prefers the popular-but-wrong Vancouver, intermediate layers prefer the correct Victoria; and a multi-step arithmetic word problem — final layer prefers the common "A × B =" completion, intermediate layers prefer "A × B ×" that keeps the discount factor on track), names benchmarks (DoLa as the prior best factuality-decoding baseline; FACTOR and TruthfulQA MC1/MC2/MC3 for multiple-choice; TruthfulQA generation for free response), and reports headline numbers: up to 16% accuracy improvement over base models and DoLa on two challenging datasets, ~4% decoding-time overhead vs DoLa, tested across Gemma 3, GPT-OSS, and Mistral (both instruction-tuned and base). SLED does not require an external knowledge base, retrieval-augmented generation, or fine-tuning, and can be stacked with other factuality-decoding methods. Code is open-source at https://github.com/JayZhang42/SLED; paper at NeurIPS 2024 (arXiv:2411.02433) (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Key takeaways¶

Factuality decoding is a sibling category to latency decoding. The post explicitly places SLED alongside speculative decoding as a "modification to the decoding process" — the final step of text generation where internal representations become tokens. Speculative decoding reshapes how fast the expert generates; SLED reshapes which token the model picks. Same architectural insertion point, different objective (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
All transformer layers already "know" the answer — SLED just listens to them. Standard decoding reads the next token off the final layer's logits. SLED observes that intermediate layers, when projected through the same final projection matrix, produce early-exit logits that contain signal the final layer sometimes overrides in favour of the "popular but wrong" completion. The canonical worked example: "What is the capital of British Columbia?" — the final layer assigns high probability to Vancouver (well-known, high co-occurrence with British Columbia in training data); intermediate layers assign higher probability to Victoria (the actual capital, less frequent in training text). A weighted average across layers surfaces Victoria (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Mechanism: reuse the final projection matrix on every layer's hidden state. The transformer's LM-head (a single linear layer mapping hidden_dim → vocab_size) is applied to the hidden states of every layer, not just the last. Each layer produces its own vocabulary distribution; SLED takes a weighted average of all distributions, weighting some layers more than others, and samples from the resulting evolved logits. No new parameters, no retraining (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Worked arithmetic example surfaces the structural failure mode. Word problem: "Ash buys 6 toys at 10 tokens each; 4+ toys get 10% off." Final layer continues "6 × 10 =" (common arithmetic pattern A × B = C dominates training data). Intermediate layers instead favour "6 × 10 ×" (keeping the discount multiplication alive). SLED's weighted-average decoding picks the "×" continuation, arriving at "6 × 10 × 0.9 = 54". Training-data frequency biases the final layer toward the common completion; intermediate layers carry the discount context the problem supplies but the final layer has "smoothed over" (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Up to 16% accuracy improvement; ~4% decoding-time overhead. On "two challenging datasets" (unspecified which; the post defers to the paper) SLED beats both the base model and DoLa — the prior best factuality-decoding method — by up to 16 percentage points. The wall-clock cost is a roughly 4% increase in decoding time vs DoLa, attributed to the extra forward-projection of each intermediate layer's hidden state through the LM-head. Framed as "minimal" tradeoff (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Cross-family applicability: Gemma 3, GPT-OSS, Mistral (IT and base). SLED is a decoding-time intervention that only requires access to intermediate-layer hidden states and the final projection matrix — both available in any open-weights transformer. The paper validates this on three unrelated model families and on both instruction-tuned and base variants, with SLED consistently outperforming DoLa across the matrix (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Composable with other factuality-decoding methods. The post claims (and the paper validates) that SLED can be stacked on top of other factuality-decoding methods rather than replacing them — the weighted-average-across-layers intervention is orthogonal to, say, contrastive decoding or logit filtering, so they compose (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
No external knowledge, no fine-tuning, no retrieval. The post positions SLED against hallucination remediation techniques that require heavier machinery — retrieval-augmented generation needs a retriever + a knowledge base + a reranker; fine-tuning needs labelled data + compute + a retraining pipeline. SLED is a pure serving-time intervention: load the same weights, change the decoding function. The pitch is low integration cost for a measurable factuality win (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Systems, concepts, patterns extracted¶

Systems
systems/sled — Self Logits Evolution Decoding itself; NeurIPS 2024; open-source at https://github.com/JayZhang42/SLED; paper arXiv:2411.02433.
systems/dola — the prior-best factuality-decoding baseline (arXiv:2309.03883, voidism/DoLa); referenced as the ~4%-overhead and accuracy-beaten comparator.
Concepts
concepts/factuality-decoding — the category name; decode-time interventions that improve LLM factuality without retraining or retrieval. SLED and DoLa are the wiki's current instances.
concepts/logits — pre-softmax prediction scores each transformer layer emits over the vocabulary; the primitive SLED operates on.
concepts/early-exit-logits — logits derived by applying the final projection matrix to an intermediate layer's hidden state; SLED's lever.
concepts/llm-hallucination — the factuality failure mode SLED (and factuality decoding generally) targets.
concepts/llm-decoding-step — the final phase of LLM text generation, where internal representations become tokens; the architectural insertion point for both speculative decoding and SLED.
Patterns
patterns/all-layer-ensemble-decoding — the pattern SLED instantiates: reuse the final projection matrix across every layer's hidden state, weight-average the resulting per-layer distributions, decode from the mixture. Orthogonal to other decoding-time interventions (composable) and training-free (no new parameters).

Operational numbers¶

Accuracy improvement: up to +16 percentage points over the base model and over DoLa, on "two challenging datasets" (paper- mediated; blog does not name which) (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Decoding-time overhead: ~4% vs DoLa, attributed to additional per-layer LM-head projections; no absolute latency or throughput figure published in the blog (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Model families validated: Gemma 3, GPT-OSS (20B), Mistral (Mixtral-8x7B-v0.1), both IT and base variants (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Benchmarks: FACTOR, TruthfulQA MC1/MC2/MC3, TruthfulQA generation; no raw scores per benchmark published in the blog (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Caveats¶

Raw-scope caveat. The locally saved raw file contains only the "Experiments" narrative fragment — benchmarks, toy problems, TruthfulQA chartreuse / lit-fireplace examples. The core SLED mechanism description, the worked British Columbia / arithmetic examples, the results chart, and the conclusion live in the rest of the blog post, retrieved from the URL in-session. Wiki pages extend up to what the full blog verifiably contains and flag paper-only detail.
"Two challenging datasets" — the blog's up-to-16% claim names the improvement magnitude but not the specific datasets on which it holds. The paper's results table is authoritative; the blog summarises.
Weighted-average specifics — the blog says SLED "takes a weighted average" and "gives more importance to some layers than others" but the weight-assignment rule (learned? fixed? heuristic? per-position?) is not specified in the post. Paper- mediated.
No production deployment named. The post is research output, not a productionised launch — no Gemini / Bard / AI Overviews / Vertex AI integration is claimed. The pitch is "you can use SLED on any open-source LLM", positioning the GitHub release as the deliverable.
No throughput / tokens-per-second numbers. "Slightly longer than normal" and "only about 4% higher than DoLa" are the only latency statements; no absolute tokens/sec, no batch-size sensitivity, no KV-cache interaction detail.
No memory-overhead numbers. Projecting every layer's hidden state through the LM-head allocates one vocab_size-dimensional distribution per layer; whether this is computed lazily, cached, or folded into the final logits isn't discussed in the blog.
Composability claim is qualitative. "Can be flexibly integrated with other factuality decoding methods to further reduce model hallucinations" — no ablation table showing SLED+X combinations in the blog itself. Paper-mediated.
No comparison to RAG / fine-tuning on the same benchmarks. The post dismisses RAG and fine-tuning as "complicated" / "requires a system" / "requires data" but doesn't show a head-to-head accuracy comparison. Scope is factuality decoding vs factuality decoding, not factuality decoding vs retrieval.

Source¶

Original: https://research.google/blog/making-llms-more-accurate-by-using-all-of-their-layers/
Raw markdown: raw/google/2025-09-17-making-llms-more-accurate-by-using-all-of-their-layers-6b4a1826.md
Paper (NeurIPS 2024): arXiv:2411.02433
Code: github.com/JayZhang42/SLED

companies/google
systems/sled
systems/dola
concepts/factuality-decoding
concepts/logits
concepts/early-exit-logits
concepts/llm-hallucination
concepts/llm-decoding-step
patterns/all-layer-ensemble-decoding
concepts/speculative-decoding — sibling decode-time intervention at the same architectural insertion point, optimising for latency rather than factuality.
systems/speculative-cascades — sibling Google Research LLM-serving primitive (2025-09-11); together with SLED these form the "LLM serving-infra latency/factuality primitives" recurring shape on the Google company page.