CONCEPT Cited by 1 source
Source-of-truth disambiguation¶
Source-of-truth disambiguation is the data-agent reasoning capability of determining the most authoritative information when multiple sources (table metadata, company documents, internal messages, dashboards) are "often outdated, contradictory, or superseded". Named in the 2026-05-08 Databricks post on Genie as the second of the three unique challenges of data agents.
The verbatim framing: "Answering business questions needs deep, specific knowledge drawn from many sources (e.g., table metadata, company documents, internal messages) that are often outdated, contradictory, or superseded, forcing the agent to determine the most authoritative information."
Why this is uniquely a data-agent problem¶
A coding agent doesn't face this. The latest commit on the canonical branch is the source of truth — by definition. There's no "this comment in the README from 2019 contradicts the actual current code" problem in the same way; the code wins.
For a data agent answering business questions, multiple sources have overlapping authority:
| Source type | Authority | Common failure mode |
|---|---|---|
| Production tables | Operationally authoritative | May not reflect business semantics correctly |
| Dashboards | Stakeholder-facing canonical view | Two dashboards can encode contradictory definitions (concepts/measure-proliferation) |
| Wiki documents | Stated business definitions | Often stale; superseded by operational changes |
| Slack threads | Recent decisions / overrides | Ephemeral; not authoritative on their own |
| Notebook code | Working analyst logic | One analyst's view, not necessarily authoritative |
| Governance metadata (Unity Catalog) | Curated authority signals | Only as good as governance discipline |
A correct answer to a business question requires ranking among these when they conflict.
Failure modes if the agent doesn't disambiguate¶
- Confident wrong answer — agent picks the first plausible source and reports it as fact, ignoring the contradiction.
- Garbage-in garbage-out — agent dutifully answers from a stale source, reproducing the staleness.
- Inconsistency across sessions — same question gets different answers depending on which source the agent retrieved that time.
- Compounding errors — wrong source feeds wrong intermediate result feeds wrong final conclusion.
Disambiguation signals¶
Not disclosed exhaustively in the source post, but plausible signals the agent can use:
| Signal | What it tells the agent |
|---|---|
| Recency | Newer over older when both purport to be canonical |
| Tier / governance label | Production-tier table over experimental |
| Lineage authority | Upstream of a dashboard ranks higher than the dashboard itself |
| Owner / steward | Documents owned by data-platform team rank higher than ad-hoc |
| Cross-corroboration | Multiple sources agreeing rank higher than one |
| Explicit override markers | "This document supersedes X" signals explicit ordering |
| Usage frequency | Heavily-queried tables more likely canonical |
| Data-completeness | Sources with fuller coverage rank higher for full questions |
Composes with semantic context¶
concepts/specialized-knowledge-search surfaces candidate sources; source-of-truth disambiguation ranks among them. The two are complementary stages of the agent's discovery pipeline:
Query
│
▼
Specialised knowledge search
│ (returns candidate set)
▼
Source-of-truth disambiguation
│ (ranks / picks authoritative subset)
▼
Ground reasoning in the chosen sources
If specialised search returns 50 plausibly-relevant tables, source-of- truth disambiguation must pick the 1–3 actually authoritative ones.
Composes with self-correction¶
When source-of-truth disambiguation is wrong — agent committed to a source that turns out non-authoritative — self-correction lets the agent revise the choice mid-trajectory. The two compose:
- Disambiguation makes the initial authoritative-source choice.
- Self-correction revises when intermediate evidence contradicts.
Without self-correction, a wrong initial source-of-truth choice would cascade into the rest of the reasoning. With it, the agent can backtrack.
Composes with upstream governance discipline¶
Source-of-truth disambiguation is only as good as the upstream data layer's governance discipline allows. The Trinity Industries case study (Source: sources/2026-04-29-databricks-companies-winning-with-ai-built-the-data-layer-first) canonicalised this empirically:
- Pre-migration state: 600 conflicting measure variants — no canonical source-of-truth.
- Post-migration state: consolidated canonical measure layer — Genie can disambiguate.
- Architectural insight: "Genie cannot disambiguate 600 conflicting measure variants" — i.e., the agent's disambiguation reasoning is bounded by the existence of an authoritative source somewhere in the workspace.
This is the load-bearing reason upstream measure-consolidation / patterns/transform-upstream-to-collapse-measures / Medallion discipline is prerequisite for Genie effectiveness.
When this fits / doesn't¶
Fits (agent disambiguation can succeed):
- Workspace has at least one authoritative source per business concept; the agent's job is to find it among the noise.
- Governance metadata (Unity Catalog tier, ownership, lineage) provides ranking signals.
- Recency / authority signals are reliable.
Doesn't fit / breaks down:
- No authoritative source exists — every variant is a fork. Disambiguation cannot manufacture authority. Upstream consolidation required.
- Authority signals contradict — the most-recent source is owned by a junior engineer; the most-authorised owner has the stale doc. Agent must surface the ambiguity rather than pick wrongly.
- No metadata — workspace is a data swamp; agent must guess.
Surfacing irreconcilable ambiguity¶
When sources cannot be disambiguated (truly contradictory authoritative signals), the agent's correct response is surface the ambiguity to the user, not pick arbitrarily. This connects back to the unanswerability sub- property — some queries should be answered with "the data is inconsistent on this point; here's the discrepancy" rather than a confident wrong answer.
Relationship to related concepts¶
- concepts/data-agent-unique-challenges lists the three challenges; this is a deep dive on the second.
- concepts/specialized-knowledge-search finds candidate sources; this concept ranks among them.
- concepts/single-source-of-truth-dashboard is the upstream discipline that makes disambiguation tractable.
- concepts/measure-proliferation is the upstream failure mode that makes disambiguation impossible.
- concepts/agent-self-correction-loop is the mid-trajectory revision mechanism that fires when the initial disambiguation proves wrong.
Seen in¶
-
sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-genie — canonical first wiki naming of source-of-truth disambiguation as a data-agent reasoning capability. Verbatim challenge: sources are "often outdated, contradictory, or superseded, forcing the agent to determine the most authoritative information." Positioned as the second of three data-agent unique challenges.
-
sources/2026-04-29-databricks-companies-winning-with-ai-built-the-data-layer-first — earlier wiki disclosure of the upstream-discipline dependency. Trinity Industries' 600-measure-variants problem framed as the load-bearing precondition for Genie's source-of-truth disambiguation to succeed.