CONCEPT Cited by 1 source

Source-of-truth disambiguation¶

Source-of-truth disambiguation is the data-agent reasoning capability of determining the most authoritative information when multiple sources (table metadata, company documents, internal messages, dashboards) are "often outdated, contradictory, or superseded". Named in the 2026-05-08 Databricks post on Genie as the second of the three unique challenges of data agents.

The verbatim framing: "Answering business questions needs deep, specific knowledge drawn from many sources (e.g., table metadata, company documents, internal messages) that are often outdated, contradictory, or superseded, forcing the agent to determine the most authoritative information."

Why this is uniquely a data-agent problem¶

A coding agent doesn't face this. The latest commit on the canonical branch is the source of truth — by definition. There's no "this comment in the README from 2019 contradicts the actual current code" problem in the same way; the code wins.

For a data agent answering business questions, multiple sources have overlapping authority:

Source type	Authority	Common failure mode
Production tables	Operationally authoritative	May not reflect business semantics correctly
Dashboards	Stakeholder-facing canonical view	Two dashboards can encode contradictory definitions (concepts/measure-proliferation)
Wiki documents	Stated business definitions	Often stale; superseded by operational changes
Slack threads	Recent decisions / overrides	Ephemeral; not authoritative on their own
Notebook code	Working analyst logic	One analyst's view, not necessarily authoritative
Governance metadata (Unity Catalog)	Curated authority signals	Only as good as governance discipline

A correct answer to a business question requires ranking among these when they conflict.

Failure modes if the agent doesn't disambiguate¶

Confident wrong answer — agent picks the first plausible source and reports it as fact, ignoring the contradiction.
Garbage-in garbage-out — agent dutifully answers from a stale source, reproducing the staleness.
Inconsistency across sessions — same question gets different answers depending on which source the agent retrieved that time.
Compounding errors — wrong source feeds wrong intermediate result feeds wrong final conclusion.

Disambiguation signals¶

Not disclosed exhaustively in the source post, but plausible signals the agent can use:

Signal	What it tells the agent
Recency	Newer over older when both purport to be canonical
Tier / governance label	Production-tier table over experimental
Lineage authority	Upstream of a dashboard ranks higher than the dashboard itself
Owner / steward	Documents owned by data-platform team rank higher than ad-hoc
Cross-corroboration	Multiple sources agreeing rank higher than one
Explicit override markers	"This document supersedes X" signals explicit ordering
Usage frequency	Heavily-queried tables more likely canonical
Data-completeness	Sources with fuller coverage rank higher for full questions

Composes with semantic context¶

concepts/specialized-knowledge-search surfaces candidate sources; source-of-truth disambiguation ranks among them. The two are complementary stages of the agent's discovery pipeline:

Query
  │
  ▼
Specialised knowledge search
  │  (returns candidate set)
  ▼
Source-of-truth disambiguation
  │  (ranks / picks authoritative subset)
  ▼
Ground reasoning in the chosen sources

If specialised search returns 50 plausibly-relevant tables, source-of- truth disambiguation must pick the 1–3 actually authoritative ones.

Composes with self-correction¶

When source-of-truth disambiguation is wrong — agent committed to a source that turns out non-authoritative — self-correction lets the agent revise the choice mid-trajectory. The two compose:

Disambiguation makes the initial authoritative-source choice.
Self-correction revises when intermediate evidence contradicts.

Without self-correction, a wrong initial source-of-truth choice would cascade into the rest of the reasoning. With it, the agent can backtrack.

Composes with upstream governance discipline¶

Source-of-truth disambiguation is only as good as the upstream data layer's governance discipline allows. The Trinity Industries case study (Source: sources/2026-04-29-databricks-companies-winning-with-ai-built-the-data-layer-first) canonicalised this empirically:

Pre-migration state: 600 conflicting measure variants — no canonical source-of-truth.
Post-migration state: consolidated canonical measure layer — Genie can disambiguate.
Architectural insight: "Genie cannot disambiguate 600 conflicting measure variants" — i.e., the agent's disambiguation reasoning is bounded by the existence of an authoritative source somewhere in the workspace.

This is the load-bearing reason upstream measure-consolidation / patterns/transform-upstream-to-collapse-measures / Medallion discipline is prerequisite for Genie effectiveness.

When this fits / doesn't¶

Fits (agent disambiguation can succeed):

Workspace has at least one authoritative source per business concept; the agent's job is to find it among the noise.
Governance metadata (Unity Catalog tier, ownership, lineage) provides ranking signals.
Recency / authority signals are reliable.

Doesn't fit / breaks down:

No authoritative source exists — every variant is a fork. Disambiguation cannot manufacture authority. Upstream consolidation required.
Authority signals contradict — the most-recent source is owned by a junior engineer; the most-authorised owner has the stale doc. Agent must surface the ambiguity rather than pick wrongly.
No metadata — workspace is a data swamp; agent must guess.

Surfacing irreconcilable ambiguity¶

When sources cannot be disambiguated (truly contradictory authoritative signals), the agent's correct response is surface the ambiguity to the user, not pick arbitrarily. This connects back to the unanswerability sub- property — some queries should be answered with "the data is inconsistent on this point; here's the discrepancy" rather than a confident wrong answer.

concepts/data-agent-unique-challenges lists the three challenges; this is a deep dive on the second.
concepts/specialized-knowledge-search finds candidate sources; this concept ranks among them.
concepts/single-source-of-truth-dashboard is the upstream discipline that makes disambiguation tractable.
concepts/measure-proliferation is the upstream failure mode that makes disambiguation impossible.
concepts/agent-self-correction-loop is the mid-trajectory revision mechanism that fires when the initial disambiguation proves wrong.

Seen in¶

sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-genie — canonical first wiki naming of source-of-truth disambiguation as a data-agent reasoning capability. Verbatim challenge: sources are "often outdated, contradictory, or superseded, forcing the agent to determine the most authoritative information." Positioned as the second of three data-agent unique challenges.
sources/2026-04-29-databricks-companies-winning-with-ai-built-the-data-layer-first — earlier wiki disclosure of the upstream-discipline dependency. Trinity Industries' 600-measure-variants problem framed as the load-bearing precondition for Genie's source-of-truth disambiguation to succeed.