SYSTEM Cited by 1 source
SLED (Self Logits Evolution Decoding)¶
SLED is a Google Research factuality-decoding method that improves the accuracy of any open-weights transformer LLM by using the early-exit logits from every transformer layer — not just the final layer — when choosing the next token. Presented at NeurIPS 2024 (arXiv:2411.02433), open-source at https://github.com/JayZhang42/SLED, published on the Google Research blog 2025-09-17.
SLED is a pure serving-time intervention: it doesn't add parameters, doesn't fine-tune, doesn't retrieve external knowledge. It modifies the LLM decoding step — the final phase where hidden states become tokens. At that phase it substitutes a weighted average of per-layer distributions for the canonical "read the final layer's logits and sample" rule (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Why the final layer isn't always right¶
The Google Research blog frames the motivation around "popular but wrong" completions:
- British Columbia example. Question: "What is the capital of British Columbia?" The final layer assigns highest probability to Vancouver — the larger, better-known city that co-occurs frequently with "British Columbia" in training text. Intermediate layers assign higher probability to Victoria — the actual capital, less frequent in training data. The final layer has smoothed toward the high-co-occurrence answer; the intermediate layers still carry the factual signal (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
- Arithmetic example. Word problem: "Ash buys 6 toys at 10 tokens each; 4+ toys get 10% off." Final layer continues "6 × 10 =" (the common A × B = C training-data pattern). Intermediate layers continue "6 × 10 ×" (preserving the discount-multiplication context). The final layer's decision is driven by pattern frequency; the intermediate layers by context the problem supplies (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
The structural insight: the model already contains factually correct signal in its intermediate layers; the standard decoding rule throws that signal away when the final layer has been biased toward a high-frequency alternative by training-data statistics.
Mechanism¶
The core operation is straightforward:
- Forward pass. Run the transformer normally; retain the
hidden state
h_iat every layeri ∈ {1, ..., L}. - Early-exit projection. Apply the transformer's final
projection matrix (the LM-head
W_lm) to every layer's hidden state, producing per-layer logitsz_i = W_lm · h_i ∈ ℝ^{vocab_size}. The final layer's logitsz_Lis what standard decoding would use. - Per-layer distributions. Softmax each
z_ito get per-layer next-token distributionsp_i = softmax(z_i). - Weighted-average evolved logits. SLED takes a weighted average of the per-layer distributions — "giving more importance to some layers than others" — producing the evolved logits / evolved distribution. The exact weight-assignment rule is paper-specified, not detailed in the blog.
- Decode. Sample or argmax the next token from the evolved distribution.
No new parameters; W_lm is reused from the model's existing LM
head. No fine-tuning; the weight-averaging rule is applied on top
of frozen weights at inference time
(Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
What the blog post discloses — numbers¶
- Up to +16 percentage points accuracy improvement over the base model and over DoLa (the prior best factuality-decoding baseline) on "two challenging datasets" (not individually named in the blog) (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
- ~4% decoding-time overhead vs DoLa, attributed to the extra per-layer LM-head projections. No absolute latency, tokens-per- second, or batch-size-dependence numbers in the blog (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
- Model families validated: Gemma 3, GPT-OSS (20B), Mistral (Mixtral-8x7B-v0.1), both instruction-tuned (IT) and base variants (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
- Benchmarks: FACTOR, TruthfulQA MC1 / MC2 / MC3 (multiple- choice), TruthfulQA generation (free response). Per-benchmark scores paper-mediated (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Composability¶
SLED can be stacked with other factuality-decoding methods. The weighted-average-across-layers intervention is orthogonal to, for example, contrastive decoding or DoLa-style contrastive-layer selection — they operate on different aspects of the logits and compose. Ablation showing the compositions is paper-mediated; the blog states the property without the table (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Relationship to other wiki primitives¶
- concepts/factuality-decoding — the category SLED instantiates. Decode-time interventions that improve LLM factual accuracy without retraining or retrieval. DoLa and SLED are the wiki's current instances; SLED is SOTA as of NeurIPS 2024.
- concepts/early-exit-logits — the lever. Reusing the final projection matrix on intermediate hidden states produces vocabulary distributions from every layer, not just the last.
- concepts/logits — the primitive SLED averages over.
- concepts/llm-hallucination — the problem SLED targets; the blog explicitly frames factuality decoding as a remediation path alongside RAG and fine-tuning.
- concepts/llm-decoding-step — the architectural insertion point. Both SLED and speculative decoding live here; same place in the stack, different objectives.
- concepts/speculative-decoding — sibling decode-time intervention. Speculative decoding optimises latency via draft-then-verify; SLED optimises factuality via all-layer ensemble. Both are training-free, both are modifications of the same final generation phase. The Google Research blog makes this parallel explicit.
- systems/speculative-cascades — a Google Research sibling (2025-09-11, one week before SLED's blog). Together with SLED they populate the "LLM serving-infra latency/factuality primitives" recurring shape on the Google company page — Google publishing the serving-side primitives themselves, not just the models that run on them.
- patterns/all-layer-ensemble-decoding — the generalised pattern SLED instantiates: reuse the final projection matrix on every layer's hidden state, weight-average the resulting distributions, decode from the mixture.
- systems/dola — the comparator. DoLa contrasts a mature layer's distribution against an earlier-layer distribution to amplify the factual signal; SLED generalises "use some early exits" to "weighted-average all of them".
What the raw does not disclose¶
- Per-layer weighting rule. "Giving more importance to some layers than others" is all the blog says. The paper specifies the rule (learned? fixed? adaptive?) — not reconstructed here.
- Which two datasets the up-to-16% claim is anchored to. Paper Table ranges authoritative.
- Memory overhead. Projecting every layer's hidden state
through the LM-head allocates
L × vocab_sizefloats per decoding step (before the weighted average collapses it back to one distribution). How this interacts with the KV cache, how much of the LM-head compute is amortised across layers, and the batch-size sensitivity are not discussed. - Training-free claim's edge cases. Whether SLED works on closed-weights / API-only models where intermediate hidden states aren't exposed — the implicit answer is "no" since the method fundamentally needs per-layer hidden states, but the blog doesn't call this out.
- Interaction with speculative decoding. Blog positions both as "modifications to decoding" but doesn't discuss composing SLED's weighted-average with speculative decoding's draft-verify cycle in the same serving path. The two operate at different token granularities (SLED per-token, speculative decoding per N-token draft) and could in principle compose, but this isn't claimed.
- No production deployment. SLED is published as a research method + open-source library, not as a productionised Google inference stack. No claim that Gemini / AI Overviews / Vertex AI uses SLED.
Seen in¶
- sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers — the introductory blog post announcing SLED; walks the British Columbia + arithmetic examples; reports cross-family results (Gemma 3 / GPT-OSS / Mistral, IT + base); up to +16% accuracy vs DoLa; ~4% decoding-time overhead; composable with other factuality decoders; open-source release pointer.
Source¶
- Original blog: https://research.google/blog/making-llms-more-accurate-by-using-all-of-their-layers/
- Paper (NeurIPS 2024): arXiv:2411.02433
- Code: github.com/JayZhang42/SLED