Skip to content

CONCEPT Cited by 1 source

Source-of-truth disambiguation

Source-of-truth disambiguation is the data-agent reasoning capability of determining the most authoritative information when multiple sources (table metadata, company documents, internal messages, dashboards) are "often outdated, contradictory, or superseded". Named in the 2026-05-08 Databricks post on Genie as the second of the three unique challenges of data agents.

The verbatim framing: "Answering business questions needs deep, specific knowledge drawn from many sources (e.g., table metadata, company documents, internal messages) that are often outdated, contradictory, or superseded, forcing the agent to determine the most authoritative information."

Why this is uniquely a data-agent problem

A coding agent doesn't face this. The latest commit on the canonical branch is the source of truth — by definition. There's no "this comment in the README from 2019 contradicts the actual current code" problem in the same way; the code wins.

For a data agent answering business questions, multiple sources have overlapping authority:

Source type Authority Common failure mode
Production tables Operationally authoritative May not reflect business semantics correctly
Dashboards Stakeholder-facing canonical view Two dashboards can encode contradictory definitions (concepts/measure-proliferation)
Wiki documents Stated business definitions Often stale; superseded by operational changes
Slack threads Recent decisions / overrides Ephemeral; not authoritative on their own
Notebook code Working analyst logic One analyst's view, not necessarily authoritative
Governance metadata (Unity Catalog) Curated authority signals Only as good as governance discipline

A correct answer to a business question requires ranking among these when they conflict.

Failure modes if the agent doesn't disambiguate

  • Confident wrong answer — agent picks the first plausible source and reports it as fact, ignoring the contradiction.
  • Garbage-in garbage-out — agent dutifully answers from a stale source, reproducing the staleness.
  • Inconsistency across sessions — same question gets different answers depending on which source the agent retrieved that time.
  • Compounding errors — wrong source feeds wrong intermediate result feeds wrong final conclusion.

Disambiguation signals

Not disclosed exhaustively in the source post, but plausible signals the agent can use:

Signal What it tells the agent
Recency Newer over older when both purport to be canonical
Tier / governance label Production-tier table over experimental
Lineage authority Upstream of a dashboard ranks higher than the dashboard itself
Owner / steward Documents owned by data-platform team rank higher than ad-hoc
Cross-corroboration Multiple sources agreeing rank higher than one
Explicit override markers "This document supersedes X" signals explicit ordering
Usage frequency Heavily-queried tables more likely canonical
Data-completeness Sources with fuller coverage rank higher for full questions

Composes with semantic context

concepts/specialized-knowledge-search surfaces candidate sources; source-of-truth disambiguation ranks among them. The two are complementary stages of the agent's discovery pipeline:

Query
Specialised knowledge search
  │  (returns candidate set)
Source-of-truth disambiguation
  │  (ranks / picks authoritative subset)
Ground reasoning in the chosen sources

If specialised search returns 50 plausibly-relevant tables, source-of- truth disambiguation must pick the 1–3 actually authoritative ones.

Composes with self-correction

When source-of-truth disambiguation is wrong — agent committed to a source that turns out non-authoritative — self-correction lets the agent revise the choice mid-trajectory. The two compose:

  • Disambiguation makes the initial authoritative-source choice.
  • Self-correction revises when intermediate evidence contradicts.

Without self-correction, a wrong initial source-of-truth choice would cascade into the rest of the reasoning. With it, the agent can backtrack.

Composes with upstream governance discipline

Source-of-truth disambiguation is only as good as the upstream data layer's governance discipline allows. The Trinity Industries case study (Source: sources/2026-04-29-databricks-companies-winning-with-ai-built-the-data-layer-first) canonicalised this empirically:

  • Pre-migration state: 600 conflicting measure variants — no canonical source-of-truth.
  • Post-migration state: consolidated canonical measure layer — Genie can disambiguate.
  • Architectural insight: "Genie cannot disambiguate 600 conflicting measure variants" — i.e., the agent's disambiguation reasoning is bounded by the existence of an authoritative source somewhere in the workspace.

This is the load-bearing reason upstream measure-consolidation / patterns/transform-upstream-to-collapse-measures / Medallion discipline is prerequisite for Genie effectiveness.

When this fits / doesn't

Fits (agent disambiguation can succeed):

  • Workspace has at least one authoritative source per business concept; the agent's job is to find it among the noise.
  • Governance metadata (Unity Catalog tier, ownership, lineage) provides ranking signals.
  • Recency / authority signals are reliable.

Doesn't fit / breaks down:

  • No authoritative source exists — every variant is a fork. Disambiguation cannot manufacture authority. Upstream consolidation required.
  • Authority signals contradict — the most-recent source is owned by a junior engineer; the most-authorised owner has the stale doc. Agent must surface the ambiguity rather than pick wrongly.
  • No metadata — workspace is a data swamp; agent must guess.

Surfacing irreconcilable ambiguity

When sources cannot be disambiguated (truly contradictory authoritative signals), the agent's correct response is surface the ambiguity to the user, not pick arbitrarily. This connects back to the unanswerability sub- property — some queries should be answered with "the data is inconsistent on this point; here's the discrepancy" rather than a confident wrong answer.

Seen in

Last updated · 542 distilled / 1,571 read