Skip to content

CONCEPT Cited by 1 source

Early-exit logits

Definition

Early-exit logits are the vocabulary-space logit vectors obtained by applying the transformer's final projection matrix (LM-head W_lm) to the hidden state of an intermediate layer, not the final layer. For a transformer with L layers and hidden states h_1, h_2, ..., h_L, the standard decoding logits are z_L = W_lm · h_L; early-exit logits are the family z_i = W_lm · h_i for i < L. Each z_i ∈ ℝ^V is a full vocabulary distribution that layer i would emit if generation stopped there (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

The name comes from the "early exit" line of transformer research — letting the model stop computing as soon as an intermediate layer is confident enough — but the factuality-decoding use case flips the motivation: early-exit logits aren't read as "the model's cheap guess", they are treated as additional signal the final layer has access to but sometimes ignores (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Why they carry useful signal

The Google Research SLED post's framing:

"Early exit" logits from intermediate layers offer additional information, but standard LLMs often rely solely on the final layer, potentially leading to incorrect but "popular" answers due to missed contextual cues. (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers)

The structural observation is that the transformer's layers are specialised — lower layers encode more syntactic / surface features, higher layers encode more semantic / task-level features, the final layer produces the decoding distribution. Training biases the final layer toward training-data-frequent completions; intermediate layers can carry the contextually-correct signal that the final layer has smoothed over.

Worked examples from the SLED post:

  • "What is the capital of British Columbia?" — final layer prefers Vancouver (popular); intermediate layers prefer Victoria (correct).
  • "6 toys × 10 tokens, 10% off if ≥4 toys" — final layer continues "6 × 10 =" (common A × B = C pattern); intermediate layers continue "6 × 10 ×" (preserves discount context).

In both, "listen to every layer's opinion" produces the correct answer (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Mechanism — LM-head reuse

The key implementation detail is that early-exit logits reuse the same LM-head W_lm as the final layer — no per-layer head is trained, no new parameters are added. W_lm is a fixed linear map from hidden-state space to vocabulary space; applying it to any layer's hidden state gives a well-typed vocabulary distribution even though the model wasn't trained to produce final-quality outputs at intermediate layers.

This is what makes factuality decoders like SLED and DoLa training-free — everything needed to extract per-layer distributions is already in the model (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

How factuality decoders consume them

  • DoLa — contrasts a mature-layer distribution against a premature-layer distribution and decodes from the contrast.
  • SLED — weighted-averages across all layer distributions and decodes from the average. Generalises pairwise contrast to full-ensemble mixing.

Both operate on the same primitive; SLED makes a stronger claim about how much useful signal is distributed across intermediate layers (not just in one pair).

Seen in

Last updated · 200 distilled / 1,178 read