Skip to content

CONCEPT Cited by 1 source

LLM decoding step

Definition

The decoding step is the final phase of LLM text generation where the model transforms its internal representations into human-readable tokens. At each step the model:

  1. Runs a forward pass over the prefix (prompt + tokens generated so far) to produce per-layer hidden states.
  2. Applies the LM-head to the final layer's hidden state to get next-token logits over the vocabulary.
  3. Selects the next token via a decoding rule (argmax for greedy, sample-from-softmax for stochastic, top-p / top-k for trimmed sampling, beam search for multi-hypothesis).
  4. Appends the token and loops.

The decoding step is the architectural insertion point for serving-time interventions that change which token the model emits without changing its weights. It is one of the most active areas of LLM-serving research: both latency optimisations (speculative decoding, draft-verify) and factuality optimisations (SLED, DoLa) live here (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Why this is a category of its own

The 2025-09-17 Google Research SLED post articulates the positioning directly:

"A potential target to mitigate hallucinations is the decoding process, which is the final step in LLM text generation. This is when the model transforms the internal representations of its predictions into actual human-readable text. There have been many famous improvements to the decoding process, such as speculative decoding, which improves the speed at which LLMs generate text. Similarly, it should be possible to employ an analogous method of 'factuality decoding'..." (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers)

This positions latency decoding and factuality decoding as sibling categories — same insertion point, different objectives:

Category Optimises Wiki instances
Quality decoding Fluency / diversity top-p, top-k, temperature, beam search
Latency decoding Throughput, wall-clock concepts/speculative-decoding, systems/speculative-cascades
Factuality decoding Factual correctness systems/sled, systems/dola

All three categories modify the decoding step and nothing else. None require retraining. All are composable with each other in principle (SLED's weighted-average over layers can be applied to a draft produced by a speculative drafter, for example).

Why it's load-bearing for serving-infra

The decoding step happens on every generated token. At production scale an LLM might emit millions of tokens per second across a fleet; any per-token overhead multiplies. This is why:

  • Speculative decoding is worth engineering: amortising the expert's compute across N accepted tokens per parallel-verify pass is a multiplicative throughput win.
  • SLED's ~4% overhead vs DoLa is worth calling out in the blog: decoding-step overhead is where per-token cost accumulates, and 4% of every generated token at fleet scale is real compute.
  • The KV cache exists for exactly this phase: caching per- token attention state so the decoding step is O(1) per token instead of O(prefix-length).

What sits at the decoding step

Seen in

Last updated · 200 distilled / 1,178 read