CONCEPT Cited by 1 source
Logits (LLM / transformer)¶
Definition¶
Logits are the pre-softmax prediction scores a transformer
LLM emits over the vocabulary at each generation step. For a
vocabulary of size V, the logit vector z ∈ ℝ^V is produced by
applying the model's LM-head (the final linear projection
matrix W_lm: hidden_dim → V) to a layer's hidden state h:
z = W_lm · h (pre-softmax, real-valued, any sign)
p = softmax(z) (post-softmax, probability distribution over V)
The decoder then picks the next token from p — argmax for greedy
decoding, sampling for stochastic decoding, or more elaborate rules
(top-p, top-k, beam search) that trim and renormalise before
sampling
(Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Why logits, not probabilities¶
Logits are the canonical intermediate representation for decoding because:
- Linear composition. Logits combine additively (weighted- average logits before softmax ≠ weighted-average probabilities, but logit-space arithmetic is what every decoding paper specifies). SLED's weighted-average-across- layers operation is a logit-space (or logit-derived- distribution-space) computation.
- Numerical stability. Softmax turns a
max_zout of a long tail of negative scores; working in log-space keeps arithmetic well-conditioned. - Temperature scaling.
softmax(z / T)is the canonical knob for sharpening (T < 1) or flattening (T > 1) the distribution — a transform defined on logits, not probabilities.
Per-layer logits vs final-layer logits¶
Standard decoding reads logits off the final layer only.
Transformer layers however produce hidden states at every layer,
and applying the same LM-head W_lm to each layer's hidden state
yields per-layer logit vectors — the
"early-exit logits" in
factuality-decoding terminology. These intermediate logits contain
information that the final layer sometimes overrides in favour of
training-data-frequency-biased alternatives
(Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
SLED and DoLa both exploit this: SLED weight-averages across all per-layer logits, DoLa pair- contrasts two of them.
Role in speculative decoding¶
Speculative decoding's verifier consumes logits the same way: the expert model runs one parallel forward pass over an N-token draft, emitting per-position logits; token verification compares each drafter token against the expert's argmax or distribution at that position. The "probabilistic match" rule speculative cascades introduces is a statement about logit-derived distributions, not raw token ids (Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference).
Seen in¶
- sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers — primary source. SLED operates directly on per-layer logits obtained by re-applying the final projection matrix across every layer; the post defines "early-exit logits" on the way to motivating the all-layer weighted average.
Related¶
- concepts/early-exit-logits — logits from intermediate transformer layers.
- concepts/llm-decoding-step — where logits become tokens.
- concepts/factuality-decoding — the category that operates on per-layer logits.
- systems/sled — weighted-average-across-layers decoder.
- concepts/speculative-decoding — verifier consumes expert logits per-position.
- concepts/token-verification — per-position accept/reject rule defined on logits.