CONCEPT Cited by 1 source

Factuality decoding¶

Definition¶

Factuality decoding is a category of decode-time interventions that improve the factual accuracy of an LLM without retraining, fine-tuning, or retrieval. The intervention sits at the decoding step — the final phase of generation where the model's internal representations become tokens — and modifies which token the model picks rather than what weights produce the representations.

Factuality decoding exists as a named category alongside latency decoding (speculative decoding, drafter/expert serving) and quality decoding (top-p / top-k / temperature / beam search): same architectural insertion point, different optimisation objective. Where speculative decoding asks "how do I emit the same token faster?" factuality decoding asks "how do I emit a more factually accurate token with the same weights?" (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Why it's a category¶

Three properties define it:

Training-free. No new parameters. No fine-tuning. No gradient updates. The intervention is purely at inference time; weights are frozen.
Retrieval-free. No external knowledge base, no retriever, no reranker. The intervention uses only information already present in the LLM. This distinguishes factuality decoding from retrieval-augmented generation (RAG), which adds a retrieval subsystem outside the model.
Serving-time. The intervention is a change to the decoding function called during generation — a code-level modification of the serving loop, not of the training pipeline. Composable with other serving-time levers (sampling parameters, other decoders, speculative-decoding-style latency primitives) (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

The structural bet: the LLM already contains the factually correct signal; standard decoding throws it away. Factuality decoding recovers it.

Canonical mechanism: all-layer logit evolution¶

The two-layer and all-layer factuality decoders — systems/dola and systems/sled respectively — share a substrate. Transformer LLMs emit logits at every layer; in standard decoding only the final layer's logits drive the next token. Factuality decoders instead project intermediate layers' hidden states through the same LM-head to get early-exit logits, then combine the per-layer distributions before decoding.

DoLa pairwise-contrasts a mature layer against a premature layer.
SLED takes a weighted average across all layers.

The weighted average generalises pairwise contrast: contrast picks two layers and subtracts, weighted-average picks all layers and sums. Both operate on the same lever (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Why the final layer can be wrong¶

The Google Research SLED post makes this explicit with two worked examples:

Popular-but-wrong named entity. "What is the capital of British Columbia?" → final layer prefers Vancouver (the larger city, high co-occurrence in training); intermediate layers prefer Victoria (the actual capital). Training-data frequency biases the final layer toward the salient answer even when the model has the factually correct alternative lower-weighted in intermediate layers (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Pattern-completion overrides context. Word problem: "6 toys at 10 tokens each; 4+ toys → 10% off." Final layer continues "6 × 10 =" (the generic "A × B = C" training-data pattern); intermediate layers continue "6 × 10 ×" (preserving the discount context the problem specified). The final layer smoothes toward the common completion; factuality decoding surfaces the context-faithful path (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Tradeoffs¶

Wall-clock overhead. Projecting every intermediate layer through the LM-head costs extra compute per decoding step. SLED vs DoLa: ~4% decoding-time overhead; vs base model, a bit more. Framed as "minimal" in the post but real for throughput-sensitive serving.
Memory overhead. Per-layer distributions are vocab_size- dimensional; storing L of them per generation step is non-trivial on large models with vocab_size ≥ 32k. Blog doesn't quantify.
Not a replacement for RAG. RAG remedies knowledge the model doesn't have; factuality decoding remedies knowledge the model has but doesn't emit. Orthogonal failure-mode class.
Open-weights only in the general case. Requires access to intermediate hidden states — unavailable on closed-weights API models.

Seen in¶

sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers — names the category explicitly ("factuality decoding methods"), positions it alongside speculative decoding as a "modification to the decoding process", and places SLED + DoLa as the two named instances.

systems/sled — Google Research 2024, all-layer weighted average; SOTA factuality decoder as of NeurIPS 2024.
systems/dola — contrastive two-layer decoder; prior SOTA.
concepts/llm-hallucination — the failure mode factuality decoding targets.
concepts/early-exit-logits — the shared primitive.
concepts/logits — the underlying scores.
concepts/llm-decoding-step — the architectural insertion point.
concepts/speculative-decoding — sibling decode-time category, optimising latency not factuality.
patterns/all-layer-ensemble-decoding — the SLED- instantiated pattern inside this category.