Skip to content

CONCEPT Cited by 1 source

Logits (LLM / transformer)

Definition

Logits are the pre-softmax prediction scores a transformer LLM emits over the vocabulary at each generation step. For a vocabulary of size V, the logit vector z ∈ ℝ^V is produced by applying the model's LM-head (the final linear projection matrix W_lm: hidden_dim → V) to a layer's hidden state h:

z = W_lm · h        (pre-softmax, real-valued, any sign)
p = softmax(z)      (post-softmax, probability distribution over V)

The decoder then picks the next token from p — argmax for greedy decoding, sampling for stochastic decoding, or more elaborate rules (top-p, top-k, beam search) that trim and renormalise before sampling (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Why logits, not probabilities

Logits are the canonical intermediate representation for decoding because:

  • Linear composition. Logits combine additively (weighted- average logits before softmax ≠ weighted-average probabilities, but logit-space arithmetic is what every decoding paper specifies). SLED's weighted-average-across- layers operation is a logit-space (or logit-derived- distribution-space) computation.
  • Numerical stability. Softmax turns a max_z out of a long tail of negative scores; working in log-space keeps arithmetic well-conditioned.
  • Temperature scaling. softmax(z / T) is the canonical knob for sharpening (T < 1) or flattening (T > 1) the distribution — a transform defined on logits, not probabilities.

Per-layer logits vs final-layer logits

Standard decoding reads logits off the final layer only. Transformer layers however produce hidden states at every layer, and applying the same LM-head W_lm to each layer's hidden state yields per-layer logit vectors — the "early-exit logits" in factuality-decoding terminology. These intermediate logits contain information that the final layer sometimes overrides in favour of training-data-frequency-biased alternatives (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

SLED and DoLa both exploit this: SLED weight-averages across all per-layer logits, DoLa pair- contrasts two of them.

Role in speculative decoding

Speculative decoding's verifier consumes logits the same way: the expert model runs one parallel forward pass over an N-token draft, emitting per-position logits; token verification compares each drafter token against the expert's argmax or distribution at that position. The "probabilistic match" rule speculative cascades introduces is a statement about logit-derived distributions, not raw token ids (Source: sources/2025-09-11-google-speculative-cascades-hybrid-approach-llm-inference).

Seen in

Last updated · 200 distilled / 1,178 read