Skip to content

CONCEPT Cited by 1 source

Data agent unique challenges

A data agent answers questions over enterprise data — structured tables, dashboards, notebooks, plus unstructured workspace files, documents, and chat logs. The 2026-05-08 Databricks post on Genie argues that data agents face three structural challenges that do not apply to coding agents (such as GitHub Copilot, Devin, Cursor's agent mode), which operate on a "static, deterministic" file-system substrate. The three challenges shape the architectural choices that distinguish data-agent designs from coding-agent designs.

The three challenges

# Challenge Coding agent baseline Data agent reality
1 Scale of data discovery File system tree (10K–100K files); directory hierarchy is the index Hundreds of thousands of tables / dashboards / documents across heterogeneous stores; "a scale that breaks conventional search methods"
2 Determining "source of truth" Source code is canonical (latest commit on the branch is correct by definition) Multiple sources (table metadata, company docs, internal messages) are "often outdated, contradictory, or superseded, forcing the agent to determine the most authoritative information"
3 Lack of verifiable tests Unit tests, type checkers, compilers, integration tests — deterministic verifiable oracles "The 'specification' is just the high-level user query without a notion of the expected correct answer" — and queries may not always be answerable due to data incompleteness

These three properties are named verbatim in the source post (Source: sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-genie).

Why each one matters architecturally

1. Scale of data discovery

Coding agents can rely on "directory tree + filename + symbol table" as the primary index. A data agent must build a semantic index over heterogeneous assets that don't share a uniform schema:

  • Tables expose schema (column names, types) but not business meaning.
  • Dashboards expose visual structure + queries but not the why.
  • Documents expose text but not relationships to data.
  • Notebooks mix code + prose + intermediate state.

Architectural response: concepts/specialized-knowledge-search + patterns/semantic-context-grounded-search-index — derive rich semantic context from relationships across asset types and use it to construct multiple search indices in parallel.

2. Determining source of truth

A data agent must operate when:

  • Two dashboards report contradictory numbers for the same metric.
  • A wiki document says one thing; the latest table says another.
  • An internal Slack thread overrides a stale documentation page.

Architectural response: concepts/source-of-truth-disambiguation as the load-bearing reasoning property — Genie must rank candidate answers by source authority, recency, and corroboration. This is part of why upstream measure-consolidation work (Trinity Industries case study) is load-bearing: if the data layer hasn't disambiguated, the agent cannot.

3. Lack of verifiable tests

A coding agent can iterate against a deterministic oracle: write code, run tests, refine. A data agent has no equivalent — the user asked "why did revenue spike on Tuesday?" and the "correct" answer is not pre-known.

Architectural response: concepts/parallel-thinking-trajectory-sampling — sample multiple agent trajectories and aggregate findings; the absence of a single verifiable oracle is replaced by trajectory agreement as a soft correctness signal. Plus self-correction as a mechanism for the agent to detect that its own intermediate calculations are inconsistent and revise without an external test.

Worked example (from the post)

A real (anonymised) Genie query, reproduced from the source: a user "notices that two enterprise dashboards reporting the same product's revenue show contradictory spikes on different dates and asks the agent to explain why."

This single query touches all three challenges:

  • Scale: relevant data lives across tables (revenue tables), dashboards (the two contradictory ones), documents (pricing contracts), and possibly internal messages (recent rate changes).
  • Source of truth: the two dashboards disagree — neither is unambiguously authoritative without further reasoning.
  • No verifiable test: there's no unit test that says "the explanation is correct." The agent must reason its way to a consistent story and verify by reconciling intermediate calculations.

The post canonicalises the four-phase trajectory the agent uses: discovery → investigation → self-correction → verification.

  • vs concepts/agentic-workflow-governance — that concept is about governing what an agent is allowed to do; this concept is about what makes the data-agent task hard in the first place.
  • vs concepts/context-engineering — context engineering is the general discipline of preparing context for any LLM-based system; data-agent challenges are a specific slice driven by the dynamic + contradictory + unverifiable nature of enterprise data.
  • vs concepts/objective-abstraction (model-serving) — that is about routing to the right model; data-agent challenges are about the structural shape of the problem domain itself.

Seen in

  • sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-geniecanonical wiki home for the data-agent vs coding-agent distinction. Three challenges named verbatim: scale of data discovery, source-of-truth determination, lack of verifiable tests. Each challenge mapped to the architectural response Genie deploys (specialised knowledge search, source-of-truth disambiguation reasoning, parallel thinking + self-correction).
Last updated · 542 distilled / 1,571 read