CONCEPT Cited by 1 source
LLM decoding step¶
Definition¶
The decoding step is the final phase of LLM text generation where the model transforms its internal representations into human-readable tokens. At each step the model:
- Runs a forward pass over the prefix (prompt + tokens generated so far) to produce per-layer hidden states.
- Applies the LM-head to the final layer's hidden state to get next-token logits over the vocabulary.
- Selects the next token via a decoding rule (argmax for greedy, sample-from-softmax for stochastic, top-p / top-k for trimmed sampling, beam search for multi-hypothesis).
- Appends the token and loops.
The decoding step is the architectural insertion point for serving-time interventions that change which token the model emits without changing its weights. It is one of the most active areas of LLM-serving research: both latency optimisations (speculative decoding, draft-verify) and factuality optimisations (SLED, DoLa) live here (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Why this is a category of its own¶
The 2025-09-17 Google Research SLED post articulates the positioning directly:
"A potential target to mitigate hallucinations is the decoding process, which is the final step in LLM text generation. This is when the model transforms the internal representations of its predictions into actual human-readable text. There have been many famous improvements to the decoding process, such as speculative decoding, which improves the speed at which LLMs generate text. Similarly, it should be possible to employ an analogous method of 'factuality decoding'..." (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers)
This positions latency decoding and factuality decoding as sibling categories — same insertion point, different objectives:
| Category | Optimises | Wiki instances |
|---|---|---|
| Quality decoding | Fluency / diversity | top-p, top-k, temperature, beam search |
| Latency decoding | Throughput, wall-clock | concepts/speculative-decoding, systems/speculative-cascades |
| Factuality decoding | Factual correctness | systems/sled, systems/dola |
All three categories modify the decoding step and nothing else. None require retraining. All are composable with each other in principle (SLED's weighted-average over layers can be applied to a draft produced by a speculative drafter, for example).
Why it's load-bearing for serving-infra¶
The decoding step happens on every generated token. At production scale an LLM might emit millions of tokens per second across a fleet; any per-token overhead multiplies. This is why:
- Speculative decoding is worth engineering: amortising the expert's compute across N accepted tokens per parallel-verify pass is a multiplicative throughput win.
- SLED's ~4% overhead vs DoLa is worth calling out in the blog: decoding-step overhead is where per-token cost accumulates, and 4% of every generated token at fleet scale is real compute.
- The KV cache exists for exactly this phase: caching per- token attention state so the decoding step is O(1) per token instead of O(prefix-length).
What sits at the decoding step¶
- concepts/logits — the primitive the decoding rule consumes.
- concepts/early-exit-logits — the intermediate-layer logits some decoders pull in alongside the final-layer logits.
- concepts/token-verification — the per-position accept/reject primitive in speculative decoding.
- KV cache — the per-request memory structure the decoder reads from and writes to.
- systems/sled, systems/dola — factuality- decoding implementations.
- concepts/speculative-decoding, systems/speculative-cascades — latency-decoding implementations.
Seen in¶
- sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers — names the decoding step explicitly ("the final step in LLM text generation") and uses it as the framing device to place factuality decoding alongside speculative decoding as siblings.