CONCEPT Cited by 1 source
Verifiable-test gap (data queries)¶
The verifiable-test gap is the structural absence of a deterministic oracle for "is this answer correct?" in open-ended data queries — the property that makes data agents fundamentally harder than coding agents. Named in the 2026-05-08 Databricks post on Genie as one of the three unique challenges of data agents.
The verbatim framing¶
From the source: "Unlike coding agents that can use deterministic, verifiable tests to iteratively refine code, data agents have no corresponding test because the 'specification' is just the high-level user query without a notion of the expected correct answer. Moreover, the queries may not always be answerable because of incompleteness in data, and it is important for data agents to be able to identify such cases and surface it back to users."
Two distinct properties bundled in this gap:
- No oracle — no deterministic test of "the answer is correct."
- No guarantee of answerability — some queries cannot be answered from available data, and the agent must detect this rather than confabulate an answer.
The contrast with coding agents¶
| Property | Coding agent | Data agent |
|---|---|---|
| Specification | "Make this test pass" (deterministic) | "Why did revenue spike Tuesday?" (open-ended) |
| Oracle | Test suite, type checker, compiler | None — no pre-known correct answer |
| Iteration loop | Edit → test → repeat until green | No clear "green" state |
| Fail-loud signal | Test failure | None — wrong answer looks like right answer |
| Always-answerable? | Yes (problem is solvable iff a passing test exists) | No (data may be incomplete) |
For a coding agent, "my code passes the tests" is a meaningful correctness claim. For a data agent, "my answer is consistent with the data I retrieved" is a much weaker claim — the retrieved data might be incomplete, contradictory, or stale.
Why this matters architecturally¶
The verifiable-test gap is the load-bearing reason data agents need parallel thinking + self-correction:
| Coding agent strategy | Why it doesn't transfer |
|---|---|
| Run tests after each edit | No equivalent — can't "test" an explanation of a revenue spike |
| Iterate-until-green | No "green" state — when do you stop? |
| Refactor with confidence | Refactor to what? No correctness target |
| Single-trajectory works | Without an oracle, single trajectory commits to potentially-wrong answer |
The architectural responses Genie deploys all flow from this gap:
- concepts/parallel-thinking-trajectory-sampling — sample N trajectories, aggregate; trajectory agreement substitutes for test-pass.
- concepts/agent-self-correction-loop — detect intra- trajectory inconsistencies as a soft oracle.
- Surface unanswerability — when the agent cannot reach a consistent answer (low trajectory agreement OR detected data incompleteness), tell the user rather than confabulate.
The unanswerability dimension¶
A subtle property highlighted in the source: "queries may not always be answerable because of incompleteness in data, and it is important for data agents to be able to identify such cases and surface it back to users."
Three failure modes the agent must distinguish:
| Failure | Coding agent equivalent | Data agent handling |
|---|---|---|
| Question unclear | Compiler error on syntax | Ask clarifying question |
| Data missing for question | Test references undefined function | Surface "data is incomplete for this question" — don't confabulate |
| Data contradictory | Tests disagree | concepts/source-of-truth-disambiguation reasoning + surface ambiguity if irreconcilable |
A data agent that silently confabulates in the missing-data case is strictly worse than one that admits "I cannot answer this with the data available" — the confabulation looks correct and is therefore acted on.
Soft correctness signals (substitutes for the missing oracle)¶
Without a hard oracle, agent designs use multiple soft signals:
| Signal | Where it comes from |
|---|---|
| Trajectory agreement | Parallel thinking — N trajectories agree |
| Internal consistency | Self-correction — no intermediate-step contradictions |
| Source-of-truth ranking | Authoritative-source signals corroborate the answer |
| Confidence calibration | Judge sub-agent rates the answer's confidence |
| Data-completeness check | Required data exists / is fresh / covers the question's scope |
| Statistical anomaly check | E.g., "this answer has very-low-sample-size; flag it" — Trinity's "automatic low-sample-size anomaly flagging" (Source: sources/2026-04-29-databricks-companies-winning-with-ai-built-the-data-layer-first) |
No single signal is sufficient. Stacking signals is how data agents approximate the missing oracle.
When the gap is most painful¶
- High-stakes decisions (CFO asks revenue question; wrong answer drives wrong financial decision).
- Adversarial or stale data (sources contradict; agent must disambiguate without ground truth).
- Long-horizon queries (multi-step reasoning compounds errors; no per-step oracle).
- Cross-system queries (no single source of truth exists; each source is partial).
When the gap is less painful¶
- Single-source queries with strong governance (one canonical table; no ambiguity).
- Closed-form queries (e.g., "how many rows in table X?" — has a deterministic answer in SQL).
- Sandboxed queries (the agent runs against a fixed snapshot; re-runs are deterministic).
These cases shrink the gap toward the coding-agent shape.
Relationship to related concepts¶
- concepts/data-agent-unique-challenges lists the three challenges; this concept is a deep dive on the third.
- concepts/parallel-thinking-trajectory-sampling is a primary architectural response.
- concepts/agent-self-correction-loop is the second architectural response (intra-trajectory).
- concepts/source-of-truth-disambiguation is the response to the data-contradictory variant of the gap.
Seen in¶
- sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-genie — canonical first wiki naming of the verifiable-test gap as a data-agent design property. Verbatim: "data agents have no corresponding test because the 'specification' is just the high- level user query without a notion of the expected correct answer." Includes the unanswerability sub-property: "queries may not always be answerable because of incompleteness in data." Architectural responses (parallel thinking + self-correction + source-of-truth disambiguation) all derive from this gap.