Skip to content

CONCEPT Cited by 1 source

Factuality decoding

Definition

Factuality decoding is a category of decode-time interventions that improve the factual accuracy of an LLM without retraining, fine-tuning, or retrieval. The intervention sits at the decoding step — the final phase of generation where the model's internal representations become tokens — and modifies which token the model picks rather than what weights produce the representations.

Factuality decoding exists as a named category alongside latency decoding (speculative decoding, drafter/expert serving) and quality decoding (top-p / top-k / temperature / beam search): same architectural insertion point, different optimisation objective. Where speculative decoding asks "how do I emit the same token faster?" factuality decoding asks "how do I emit a more factually accurate token with the same weights?" (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Why it's a category

Three properties define it:

  1. Training-free. No new parameters. No fine-tuning. No gradient updates. The intervention is purely at inference time; weights are frozen.
  2. Retrieval-free. No external knowledge base, no retriever, no reranker. The intervention uses only information already present in the LLM. This distinguishes factuality decoding from retrieval-augmented generation (RAG), which adds a retrieval subsystem outside the model.
  3. Serving-time. The intervention is a change to the decoding function called during generation — a code-level modification of the serving loop, not of the training pipeline. Composable with other serving-time levers (sampling parameters, other decoders, speculative-decoding-style latency primitives) (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

The structural bet: the LLM already contains the factually correct signal; standard decoding throws it away. Factuality decoding recovers it.

Canonical mechanism: all-layer logit evolution

The two-layer and all-layer factuality decoders — systems/dola and systems/sled respectively — share a substrate. Transformer LLMs emit logits at every layer; in standard decoding only the final layer's logits drive the next token. Factuality decoders instead project intermediate layers' hidden states through the same LM-head to get early-exit logits, then combine the per-layer distributions before decoding.

  • DoLa pairwise-contrasts a mature layer against a premature layer.
  • SLED takes a weighted average across all layers.

The weighted average generalises pairwise contrast: contrast picks two layers and subtracts, weighted-average picks all layers and sums. Both operate on the same lever (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Why the final layer can be wrong

The Google Research SLED post makes this explicit with two worked examples:

  • Popular-but-wrong named entity. "What is the capital of British Columbia?" → final layer prefers Vancouver (the larger city, high co-occurrence in training); intermediate layers prefer Victoria (the actual capital). Training-data frequency biases the final layer toward the salient answer even when the model has the factually correct alternative lower-weighted in intermediate layers (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
  • Pattern-completion overrides context. Word problem: "6 toys at 10 tokens each; 4+ toys → 10% off." Final layer continues "6 × 10 =" (the generic "A × B = C" training-data pattern); intermediate layers continue "6 × 10 ×" (preserving the discount context the problem specified). The final layer smoothes toward the common completion; factuality decoding surfaces the context-faithful path (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Tradeoffs

  • Wall-clock overhead. Projecting every intermediate layer through the LM-head costs extra compute per decoding step. SLED vs DoLa: ~4% decoding-time overhead; vs base model, a bit more. Framed as "minimal" in the post but real for throughput-sensitive serving.
  • Memory overhead. Per-layer distributions are vocab_size- dimensional; storing L of them per generation step is non-trivial on large models with vocab_size ≥ 32k. Blog doesn't quantify.
  • Not a replacement for RAG. RAG remedies knowledge the model doesn't have; factuality decoding remedies knowledge the model has but doesn't emit. Orthogonal failure-mode class.
  • Open-weights only in the general case. Requires access to intermediate hidden states — unavailable on closed-weights API models.

Seen in

Last updated · 200 distilled / 1,178 read