CONCEPT Cited by 7 sources

LLM hallucination¶

Definition¶

LLM hallucination is the failure mode where a language model "confidently makes claims that are incorrect" — it generates output that is linguistically plausible and high-confidence under the model's own distribution but factually wrong relative to real-world knowledge (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Factuality, by contrast, is the ability to generate content consistent with real-world knowledge. Hallucination is failure of factuality — the model's outputs are internally coherent but externally wrong.

Root causes (per the SLED post's framing)¶

The 2025-09-17 Google Research SLED blog post enumerates the standard causes:

Incomplete, inaccurate, or biased training data — the model absorbs whatever the corpus contains; gaps in the corpus become gaps in its knowledge that the model fills with plausible confabulation rather than refusing.
Overfitting or underfitting — either extreme produces fabrication: overfitting memorises noise, underfitting smooths away relevant context.
Lack of real-world experience — the model has no grounding outside its corpus; predictions about the physical world, recent events, or unstated context can only be inferred from text.
Ambiguous questions — the model resolves ambiguity toward the most frequent interpretation in training data, which may not be the intended one (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

The SLED post adds a fifth, structural cause specific to how decoding works: training-data frequency biases the final layer of the transformer toward "popular" completions even when the model has the factually correct alternative somewhere in its stack. This is the failure mode factuality decoding targets directly — the correct answer is present in intermediate early-exit logits but the final- layer-only decoding rule throws it away (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Canonical worked examples (SLED post)¶

Popular-but-wrong named entity. "What is the capital of British Columbia?" The model answers Vancouver (a bigger, better-known BC city) rather than Victoria (the actual capital). The model has seen "British Columbia" and "Vancouver" together orders of magnitude more often than "British Columbia" and "Victoria" in training text; the final layer's distribution reflects that frequency (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Teleportation hallucination. "What happens if you step into a lit fireplace and state a location?" The unwanted hallucinated answer: "This action could be interpreted as a form of teleportation magic, where stating a location while stepping into the fire would magically transport you to that place." The model resolves the ambiguity fiction-ward despite no fictional framing in the prompt. The wanted answer: "You will be injured" or "You may suffer from severe burns" (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
Pattern-completion hallucination. Word problem with a discount: "6 toys × 10 tokens each, 10% off if ≥4 toys." The model continues "6 × 10 = 60" following the generic "A × B = C" pattern from training data, dropping the discount the problem specified. Contextual correctness loses to pattern frequency (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

Remediation classes¶

The SLED post names three categories of hallucination remediation and positions factuality decoding as the cheapest of them:

Retrieval-augmented generation (RAG). Retrieve relevant documents from an external knowledge base at inference time and condition the model on them. "Requires a more complicated system to identify and retrieve relevant data, and even then, LLMs may still hallucinate."
Fine-tuning. Update the model's weights on curated correct- answer data. Requires labelled data, compute, and a retraining pipeline.
Factuality decoding. Modify the decoding function only; no new data, no new weights. SLED and DoLa are the named instances. Pitched as "no external knowledge base or data fine-tuning" — low integration cost, measurable factuality gain (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

These are not mutually exclusive. Factuality decoding composes with RAG (RAG retrieves external facts, factuality decoding surfaces the model's internal ones) and with fine-tuning (tune for a domain, then decode with SLED on top). The SLED post names composability with other factuality decoders explicitly; RAG + factuality-decoder stacks are implicit.

Why factuality decoding exists as a category¶

The structural premise behind factuality decoding is that the LLM already contains the correct answer in its intermediate-layer representations; the final-layer-only decoding rule is what throws that answer away. If the premise holds, then fixing hallucination doesn't require new knowledge at all — it requires a different rule for reading the model's existing state (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).

This is not a universal remedy — hallucinations driven by corpus gaps (the model genuinely doesn't know) can't be fixed by different decoding. But for the "popular but wrong" failure mode — where the correct signal is present but drowned out by frequency bias — factuality decoding is measurably effective (up to +16 percentage points vs base in SLED's reported case).

Seen in¶

sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers — the SLED post defines hallucination, enumerates causes, names the three remediation categories, and positions SLED as the factuality-decoding instance that dominates DoLa.
sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls — Corless 2026-01 Redpanda post uses the "d20 test" image- generation failure mode as a running example of frontier-AI hallucination at the "if we can't get it to draw a realistic d20, why would we spend a million dollars a year on it?" altitude. Explicitly names hallucination as orthogonal to scaling: "it doesn't fix fundamental issues with transformer models like hallucinations" — more data + more parameters do not close the hallucination gap. Companion to concepts/llm-model-drift — snowflakes-melt-over-time is the temporal failure mode sibling of hallucination's within-a-single-output failure mode.
sources/2024-06-19-slack-ai-powered-conversion-from-enzyme-to-react-testing-library — Slack's 2024-06 retrospective frames hallucination as the central failure mode of pure-prompt LLM code conversion at scale: Claude 2.1 prompted to convert Enzyme tests to React Testing Library produced 40-60% success with wild variance by task complexity. Canonical datum that prompt engineering alone cannot mitigate hallucination for deterministic code-transformation tasks — "our attempts to refine prompts had limited success… possibly perplexing the AI model rather than aiding it". The structural fix is AST + LLM hybrid conversion + in-code annotations + runtime-context injection; the composed pipeline lifted Slack's conversion quality to ~80% on evaluation files. This instantiates concepts/llm-conversion-hallucination-control as a distinct sub-class of hallucination where correctness is binary (tests pass or they don't) rather than factuality is graded.
sources/2026-04-13-slack-managing-context-in-long-run-agentic-applications — Slack's 2026-04 second-post on the Spear multi-agent security-investigation service canonicalises multi-agent hallucination control via two stacked filters: (1) per-finding credibility scoring against a 5-level rubric (disclosed distribution over 170,000 findings: 25.8% sub-plausibility rate — roughly one in four Expert findings fails the plausibility threshold), (2) narrative-coherence scoring that prunes findings inconsistent with the broader story. Canonical load-bearing claim: "A hallucination can only survive this process if it is more coherent with the body of evidence than any real observation it competes with." Also canonicalises three stacked mitigations against Critic self-hallucination: stronger model tier (+ bounded token scope), narrow instructions ("only make a judgement on the submitted findings"), and the downstream Timeline task as coherence check. New structural framing: methodology audit via tool-call introspection (patterns/critic-tool-call-introspection-suite) as a hallucination-control primitive — not just auditing what the Expert claimed but how it got there.
sources/2026-01-08-vercel-how-we-made-v0-an-effective-coding-agent — Vercel's 2026-01 v0 retrospective canonicalises two new structural-drift hallucination sub-classes with named mechanism fixes: (1) training-cutoff dynamism gap — the gap between frozen parametric knowledge and a frequently-releasing library; paradigmatic example is the Vercel AI SDK shipping major/minor releases that the model's training cutoff can't track; fix is dynamic knowledge-injection prompt + co-maintained curated example fs. (2) LLM icon hallucination — the specialised failure mode where the LLM references symbols from churning library namespaces (weekly updates for systems/lucide-react) that "no longer exist or never existed"; fix is embedding-based name resolution inside streaming output rewrite (LLM Suspense), <100 ms per substitution with no further model calls, Triangle-as-VercelLogo worked example. Also canonicalises the web-search telephone game as a named failure mode of RAG-via-web-search where "a smaller model summarize[s] the results of web search, which in turn becomes a bad game of telephone between the small model and parent model" that "may hallucinate, misquote something, or omit important information." Overall quantified aggregate: ~10 % baseline code-generation error rate reduced by a "double-digit" percentage-point margin via the composite pipeline (patterns/composite-model-pipeline). Distinct from the SLED factuality-decoding remediation axis and from the Slack hybrid-pipeline / multi-agent-critic axes: here the root cause is parametric-knowledge staleness (not popular-but-wrong frequency bias and not per-step plausibility), and the mitigation is out-of-band knowledge injection + deterministic post-hoc rewrite (not alternative decoding and not scoring rubrics).
sources/2025-09-24-zalando-dead-ends-or-data-goldmines-ai-powered-postmortem-analysis — two new structural sub-classes at SRE-corpus altitude, from Zalando's 2025-09-24 datastore-team postmortem-analysis- pipeline post. (1) Lost-in-the-middle effect — the large-context failure mode: "details in the middle of long inputs are often overlooked or distorted" — observed empirically on NotebookLM over thousands of postmortems, producing "severe hallucinations and loss of the incident context." Motivated the shift to a multi-stage pipeline over a single large-context prompt. (2) Surface attribution error — the causal-reasoning failure mode: "the model makes a bias to prominent keywords staging on the surface-level instead of reasoning through context to identify the actual causal factor… it could offer a well-structured and authoritative explanation regarding the contribution of AWS S3 to an incident, even if 'S3' is merely mentioned without being causally linked." Quantified at ~10% on Claude Sonnet 4 — canonical wiki datum that attribution error is a structural reasoning limitation rather than a scale problem; it does not disappear with frontier capability, it has to be engineered around. Rate evolution across model tiers, also disclosed: small open-source 3B–12B at up to 40% hallucination → prompt-hardened + curated to < 15% → Claude Sonnet 4 "negligible", with the ~10% surface- attribution tail as the residual. Mitigation stack: patterns/negative-example-prompting for the Classification stage, TELeR-maximal refuse-on-ambiguity prompts (concepts/teler-prompt-framework) for the Summarization stage, per-stage human curation of the 3–5- sentence digests ("the pivotal role of digests allowed humans to observe all incidents as a whole and precisely validate and curate the reports produced by LLMs"), and proofreading of the Patterns-stage one-pager output even at 10–20% sampling maturity. Fine-tuning is named as the unshipped roadmap specifically for Zalando-internal technology surface-attribution (named example: Skipper, where base-model public-corpus training produces "unacceptable" attributions). Canonical wiki production instance of hallucination control via multi-stage pipeline + human curation at an SRE-corpus, rather than code-transformation or multi-agent-critic, altitude.
sources/2026-05-27-yelp-beyond-the-menu-tree-how-yelp-built-a-smarter-customer-success-chatbot — new sub-class at customer-support-RAG altitude: LLM hyperlink hallucination. Yelp identifies URL fabrication as "one of the most notable unexpected challenges" in their LLM-Assisted CS Chatbot — "the tendency of Large Language Models to hallucinate hyperlinks frequently. Since our knowledge base articles contain numerous hyperlinks, and we intended for the LLM-generated responses to include accurate links, this required a dedicated solution." Mitigation is structural, not prompt-engineering: extract URLs from retrieved-context articles into a per-response allowlist, validate every URL in LLM output against the allowlist, strip/reject anything else (patterns/hyperlink-allowlist-validation-on-llm-output). "This verification process ensures that any link included in the final response genuinely originates from one of the retrieved Support Center articles and is not invented by the LLM." The pattern is one of three output-validation gate axes (trust & safety / valid URL / character limit) that run after LLM generation and before delivery to the user. Binary-correctness symbolic-space hallucination sibling to Vercel v0's icon hallucination — both have a known-finite valid set, and both are fixed by deterministic post-hoc rewrite rather than prompt engineering. The mechanisms differ: Vercel uses embedding similarity (icons live in a meaningful semantic space); Yelp uses exact allowlist match (URL strings have no useful similarity space — /help/refund ≠ /help/refunds is binary). No quantitative residual rate is disclosed; the "frequently" qualitative framing is the only signal of pre-mitigation incidence rate.

concepts/factuality-decoding — the remediation category this failure mode motivates.
concepts/llm-conversion-hallucination-control — the code-transformation-specific sub-class of the problem.
concepts/credibility-scoring-rubric — the multi-agent- critic-scoring primitive for findings.
concepts/narrative-coherence-as-hallucination-filter — the multi-agent-critic-scoring primitive for stories.
concepts/weakly-adversarial-critic — the critic stance that makes these primitives effective.
patterns/ast-plus-llm-hybrid-conversion — the hybrid-pipeline mitigation pattern.
patterns/in-code-annotation-as-llm-guidance — the attention-shaping mechanism inside the hybrid pattern.
patterns/critic-tool-call-introspection-suite — the methodology-audit-tool primitive.
patterns/three-channel-context-architecture — the multi-agent-context primitive.
patterns/timeline-assembly-from-scored-findings — the narrative-coherence-filter primitive.
systems/sled — the SOTA (NeurIPS 2024) factuality decoder.
systems/dola — the prior-SOTA factuality decoder.
systems/slack-spear — canonical multi-agent hallucination-control reference system.
concepts/early-exit-logits — the signal factuality decoders use to dodge the "popular but wrong" failure mode.
concepts/logits — the primitive.
concepts/llm-decoding-step — the remediation insertion point.
concepts/llm-icon-hallucination — the structural-drift sub-class for churning library namespaces.
concepts/llm-hyperlink-hallucination — the URL-fabrication sub-class (Yelp CS Chatbot, 2026-05-27); binary-correctness symbolic-space hallucination mitigated by deterministic per-response-allowlist validation (patterns/hyperlink-allowlist-validation-on-llm-output), not prompt engineering. Sibling on the wiki to icon hallucination (also binary-correctness, fixed by embedding-based name resolution).
concepts/training-cutoff-dynamism-gap — the parametric- knowledge-staleness driver.
concepts/web-search-telephone-game — the summariser- mediated RAG failure mode.
concepts/llm-code-generation-error-rate — the aggregate metric.
patterns/composite-model-pipeline — the pipeline-wide mitigation architecture.
patterns/streaming-output-rewrite — token-stream rewrite layer (LLM Suspense).
patterns/embedding-based-name-resolution — the symbol-space-hallucination fix.
patterns/deterministic-plus-model-autofixer — the post-stream AST-based remediation layer.
systems/vercel-v0 — canonical composite-pipeline reference system.
concepts/surface-attribution-error — the causal- reasoning sub-class where the model attributes to entities whose names appear without being causally involved (Zalando, ~10% on Claude Sonnet 4).
concepts/lost-in-the-middle-effect — the large-context failure mode motivating multi-stage pipeline architectures.
concepts/map-fold-llm-pipeline — the functional- composition primitive used to avoid lost-in-the-middle.
concepts/teler-prompt-framework — the structured- prompting framework used at the refuse-on-ambiguity gate.
patterns/multi-stage-llm-pipeline-over-large-context — the architectural pattern motivated by lost-in-the-middle.
patterns/negative-example-prompting — the prompt- hardening technique targeted at surface-attribution errors.
systems/zalando-postmortem-analysis-pipeline — canonical production instance at SRE-corpus altitude.