Skip to content

SYSTEM Cited by 1 source

SLED (Self Logits Evolution Decoding)

SLED is a Google Research factuality-decoding method that improves the accuracy of any open-weights transformer LLM by using the early-exit logits from every transformer layer — not just the final layer — when choosing the next token. Presented at NeurIPS 2024 (arXiv:2411.02433), open-source at https://github.com/JayZhang42/SLED, published on the Google Research blog 2025-09-17.

SLED is a pure serving-time intervention: it doesn't add parameters, doesn't fine-tune, doesn't retrieve external knowledge. It modifies the LLM decoding step — the final phase where hidden states become tokens. At that phase it substitutes a weighted average of per-layer distributions for the canonical "read the final layer's logits and sample" rule (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Why the final layer isn't always right

The Google Research blog frames the motivation around "popular but wrong" completions:

  • British Columbia example. Question: "What is the capital of British Columbia?" The final layer assigns highest probability to Vancouver — the larger, better-known city that co-occurs frequently with "British Columbia" in training text. Intermediate layers assign higher probability to Victoria — the actual capital, less frequent in training data. The final layer has smoothed toward the high-co-occurrence answer; the intermediate layers still carry the factual signal (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
  • Arithmetic example. Word problem: "Ash buys 6 toys at 10 tokens each; 4+ toys get 10% off." Final layer continues "6 × 10 =" (the common A × B = C training-data pattern). Intermediate layers continue "6 × 10 ×" (preserving the discount-multiplication context). The final layer's decision is driven by pattern frequency; the intermediate layers by context the problem supplies (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

The structural insight: the model already contains factually correct signal in its intermediate layers; the standard decoding rule throws that signal away when the final layer has been biased toward a high-frequency alternative by training-data statistics.

Mechanism

The core operation is straightforward:

  1. Forward pass. Run the transformer normally; retain the hidden state h_i at every layer i ∈ {1, ..., L}.
  2. Early-exit projection. Apply the transformer's final projection matrix (the LM-head W_lm) to every layer's hidden state, producing per-layer logits z_i = W_lm · h_i ∈ ℝ^{vocab_size}. The final layer's logits z_L is what standard decoding would use.
  3. Per-layer distributions. Softmax each z_i to get per-layer next-token distributions p_i = softmax(z_i).
  4. Weighted-average evolved logits. SLED takes a weighted average of the per-layer distributions — "giving more importance to some layers than others" — producing the evolved logits / evolved distribution. The exact weight-assignment rule is paper-specified, not detailed in the blog.
  5. Decode. Sample or argmax the next token from the evolved distribution.

No new parameters; W_lm is reused from the model's existing LM head. No fine-tuning; the weight-averaging rule is applied on top of frozen weights at inference time (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

What the blog post discloses — numbers

Composability

SLED can be stacked with other factuality-decoding methods. The weighted-average-across-layers intervention is orthogonal to, for example, contrastive decoding or DoLa-style contrastive-layer selection — they operate on different aspects of the logits and compose. Ablation showing the compositions is paper-mediated; the blog states the property without the table (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Relationship to other wiki primitives

  • concepts/factuality-decoding — the category SLED instantiates. Decode-time interventions that improve LLM factual accuracy without retraining or retrieval. DoLa and SLED are the wiki's current instances; SLED is SOTA as of NeurIPS 2024.
  • concepts/early-exit-logits — the lever. Reusing the final projection matrix on intermediate hidden states produces vocabulary distributions from every layer, not just the last.
  • concepts/logits — the primitive SLED averages over.
  • concepts/llm-hallucination — the problem SLED targets; the blog explicitly frames factuality decoding as a remediation path alongside RAG and fine-tuning.
  • concepts/llm-decoding-step — the architectural insertion point. Both SLED and speculative decoding live here; same place in the stack, different objectives.
  • concepts/speculative-decoding — sibling decode-time intervention. Speculative decoding optimises latency via draft-then-verify; SLED optimises factuality via all-layer ensemble. Both are training-free, both are modifications of the same final generation phase. The Google Research blog makes this parallel explicit.
  • systems/speculative-cascades — a Google Research sibling (2025-09-11, one week before SLED's blog). Together with SLED they populate the "LLM serving-infra latency/factuality primitives" recurring shape on the Google company page — Google publishing the serving-side primitives themselves, not just the models that run on them.
  • patterns/all-layer-ensemble-decoding — the generalised pattern SLED instantiates: reuse the final projection matrix on every layer's hidden state, weight-average the resulting distributions, decode from the mixture.
  • systems/dola — the comparator. DoLa contrasts a mature layer's distribution against an earlier-layer distribution to amplify the factual signal; SLED generalises "use some early exits" to "weighted-average all of them".

What the raw does not disclose

  • Per-layer weighting rule. "Giving more importance to some layers than others" is all the blog says. The paper specifies the rule (learned? fixed? adaptive?) — not reconstructed here.
  • Which two datasets the up-to-16% claim is anchored to. Paper Table ranges authoritative.
  • Memory overhead. Projecting every layer's hidden state through the LM-head allocates L × vocab_size floats per decoding step (before the weighted average collapses it back to one distribution). How this interacts with the KV cache, how much of the LM-head compute is amortised across layers, and the batch-size sensitivity are not discussed.
  • Training-free claim's edge cases. Whether SLED works on closed-weights / API-only models where intermediate hidden states aren't exposed — the implicit answer is "no" since the method fundamentally needs per-layer hidden states, but the blog doesn't call this out.
  • Interaction with speculative decoding. Blog positions both as "modifications to decoding" but doesn't discuss composing SLED's weighted-average with speculative decoding's draft-verify cycle in the same serving path. The two operate at different token granularities (SLED per-token, speculative decoding per N-token draft) and could in principle compose, but this isn't claimed.
  • No production deployment. SLED is published as a research method + open-source library, not as a productionised Google inference stack. No claim that Gemini / AI Overviews / Vertex AI uses SLED.

Seen in

Source

Last updated · 200 distilled / 1,178 read