CONCEPT Cited by 1 source

Verifiable-test gap (data queries)¶

The verifiable-test gap is the structural absence of a deterministic oracle for "is this answer correct?" in open-ended data queries — the property that makes data agents fundamentally harder than coding agents. Named in the 2026-05-08 Databricks post on Genie as one of the three unique challenges of data agents.

The verbatim framing¶

From the source: "Unlike coding agents that can use deterministic, verifiable tests to iteratively refine code, data agents have no corresponding test because the 'specification' is just the high-level user query without a notion of the expected correct answer. Moreover, the queries may not always be answerable because of incompleteness in data, and it is important for data agents to be able to identify such cases and surface it back to users."

Two distinct properties bundled in this gap:

No oracle — no deterministic test of "the answer is correct."
No guarantee of answerability — some queries cannot be answered from available data, and the agent must detect this rather than confabulate an answer.

The contrast with coding agents¶

Property	Coding agent	Data agent
Specification	"Make this test pass" (deterministic)	"Why did revenue spike Tuesday?" (open-ended)
Oracle	Test suite, type checker, compiler	None — no pre-known correct answer
Iteration loop	Edit → test → repeat until green	No clear "green" state
Fail-loud signal	Test failure	None — wrong answer looks like right answer
Always-answerable?	Yes (problem is solvable iff a passing test exists)	No (data may be incomplete)

For a coding agent, "my code passes the tests" is a meaningful correctness claim. For a data agent, "my answer is consistent with the data I retrieved" is a much weaker claim — the retrieved data might be incomplete, contradictory, or stale.

Why this matters architecturally¶

The verifiable-test gap is the load-bearing reason data agents need parallel thinking + self-correction:

Coding agent strategy	Why it doesn't transfer
Run tests after each edit	No equivalent — can't "test" an explanation of a revenue spike
Iterate-until-green	No "green" state — when do you stop?
Refactor with confidence	Refactor to what? No correctness target
Single-trajectory works	Without an oracle, single trajectory commits to potentially-wrong answer

The architectural responses Genie deploys all flow from this gap:

concepts/parallel-thinking-trajectory-sampling — sample N trajectories, aggregate; trajectory agreement substitutes for test-pass.
concepts/agent-self-correction-loop — detect intra- trajectory inconsistencies as a soft oracle.
Surface unanswerability — when the agent cannot reach a consistent answer (low trajectory agreement OR detected data incompleteness), tell the user rather than confabulate.

The unanswerability dimension¶

A subtle property highlighted in the source: "queries may not always be answerable because of incompleteness in data, and it is important for data agents to be able to identify such cases and surface it back to users."

Three failure modes the agent must distinguish:

Failure	Coding agent equivalent	Data agent handling
Question unclear	Compiler error on syntax	Ask clarifying question
Data missing for question	Test references undefined function	Surface "data is incomplete for this question" — don't confabulate
Data contradictory	Tests disagree	concepts/source-of-truth-disambiguation reasoning + surface ambiguity if irreconcilable

A data agent that silently confabulates in the missing-data case is strictly worse than one that admits "I cannot answer this with the data available" — the confabulation looks correct and is therefore acted on.

Soft correctness signals (substitutes for the missing oracle)¶

Without a hard oracle, agent designs use multiple soft signals:

Signal	Where it comes from
Trajectory agreement	Parallel thinking — N trajectories agree
Internal consistency	Self-correction — no intermediate-step contradictions
Source-of-truth ranking	Authoritative-source signals corroborate the answer
Confidence calibration	Judge sub-agent rates the answer's confidence
Data-completeness check	Required data exists / is fresh / covers the question's scope
Statistical anomaly check	E.g., "this answer has very-low-sample-size; flag it" — Trinity's "automatic low-sample-size anomaly flagging" (Source: sources/2026-04-29-databricks-companies-winning-with-ai-built-the-data-layer-first)

No single signal is sufficient. Stacking signals is how data agents approximate the missing oracle.

When the gap is most painful¶

High-stakes decisions (CFO asks revenue question; wrong answer drives wrong financial decision).
Adversarial or stale data (sources contradict; agent must disambiguate without ground truth).
Long-horizon queries (multi-step reasoning compounds errors; no per-step oracle).
Cross-system queries (no single source of truth exists; each source is partial).

When the gap is less painful¶

Single-source queries with strong governance (one canonical table; no ambiguity).
Closed-form queries (e.g., "how many rows in table X?" — has a deterministic answer in SQL).
Sandboxed queries (the agent runs against a fixed snapshot; re-runs are deterministic).

These cases shrink the gap toward the coding-agent shape.

concepts/data-agent-unique-challenges lists the three challenges; this concept is a deep dive on the third.
concepts/parallel-thinking-trajectory-sampling is a primary architectural response.
concepts/agent-self-correction-loop is the second architectural response (intra-trajectory).
concepts/source-of-truth-disambiguation is the response to the data-contradictory variant of the gap.

Seen in¶

sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-genie — canonical first wiki naming of the verifiable-test gap as a data-agent design property. Verbatim: "data agents have no corresponding test because the 'specification' is just the high- level user query without a notion of the expected correct answer." Includes the unanswerability sub-property: "queries may not always be answerable because of incompleteness in data." Architectural responses (parallel thinking + self-correction + source-of-truth disambiguation) all derive from this gap.