CONCEPT Cited by 1 source

Logits (LLM / transformer)¶

Definition¶

Logits are the pre-softmax prediction scores a transformer LLM emits over the vocabulary at each generation step. For a vocabulary of size V, the logit vector z ∈ ℝ^V is produced by applying the model's LM-head (the final linear projection matrix W_lm: hidden_dim → V) to a layer's hidden state h:

z = W_lm · h        (pre-softmax, real-valued, any sign)
p = softmax(z)      (post-softmax, probability distribution over V)

The decoder then picks the next token from p — argmax for greedy decoding, sampling for stochastic decoding, or more elaborate rules (top-p, top-k, beam search) that trim and renormalise before sampling (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Why logits, not probabilities¶

Logits are the canonical intermediate representation for decoding because:

Linear composition. Logits combine additively (weighted- average logits before softmax ≠ weighted-average probabilities, but logit-space arithmetic is what every decoding paper specifies). SLED's weighted-average-across- layers operation is a logit-space (or logit-derived- distribution-space) computation.
Numerical stability. Softmax turns a max_z out of a long tail of negative scores; working in log-space keeps arithmetic well-conditioned.
Temperature scaling. softmax(z / T) is the canonical knob for sharpening (T < 1) or flattening (T > 1) the distribution — a transform defined on logits, not probabilities.

Per-layer logits vs final-layer logits¶

Standard decoding reads logits off the final layer only. Transformer layers however produce hidden states at every layer, and applying the same LM-head W_lm to each layer's hidden state yields per-layer logit vectors — the "early-exit logits" in factuality-decoding terminology. These intermediate logits contain information that the final layer sometimes overrides in favour of training-data-frequency-biased alternatives (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

SLED and DoLa both exploit this: SLED weight-averages across all per-layer logits, DoLa pair- contrasts two of them.

Role in speculative decoding¶

Speculative decoding's verifier consumes logits the same way: the expert model runs one parallel forward pass over an N-token draft, emitting per-position logits; token verification compares each drafter token against the expert's argmax or distribution at that position. The "probabilistic match" rule speculative cascades introduces is a statement about logit-derived distributions, not raw token ids (Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference).

Seen in¶

sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers — primary source. SLED operates directly on per-layer logits obtained by re-applying the final projection matrix across every layer; the post defines "early-exit logits" on the way to motivating the all-layer weighted average.

concepts/early-exit-logits — logits from intermediate transformer layers.
concepts/llm-decoding-step — where logits become tokens.
concepts/factuality-decoding — the category that operates on per-layer logits.
systems/sled — weighted-average-across-layers decoder.
concepts/speculative-decoding — verifier consumes expert logits per-position.
concepts/token-verification — per-position accept/reject rule defined on logits.