Skip to content

CONCEPT Cited by 1 source

Verifiable-test gap (data queries)

The verifiable-test gap is the structural absence of a deterministic oracle for "is this answer correct?" in open-ended data queries — the property that makes data agents fundamentally harder than coding agents. Named in the 2026-05-08 Databricks post on Genie as one of the three unique challenges of data agents.

The verbatim framing

From the source: "Unlike coding agents that can use deterministic, verifiable tests to iteratively refine code, data agents have no corresponding test because the 'specification' is just the high-level user query without a notion of the expected correct answer. Moreover, the queries may not always be answerable because of incompleteness in data, and it is important for data agents to be able to identify such cases and surface it back to users."

Two distinct properties bundled in this gap:

  1. No oracle — no deterministic test of "the answer is correct."
  2. No guarantee of answerability — some queries cannot be answered from available data, and the agent must detect this rather than confabulate an answer.

The contrast with coding agents

Property Coding agent Data agent
Specification "Make this test pass" (deterministic) "Why did revenue spike Tuesday?" (open-ended)
Oracle Test suite, type checker, compiler None — no pre-known correct answer
Iteration loop Edit → test → repeat until green No clear "green" state
Fail-loud signal Test failure None — wrong answer looks like right answer
Always-answerable? Yes (problem is solvable iff a passing test exists) No (data may be incomplete)

For a coding agent, "my code passes the tests" is a meaningful correctness claim. For a data agent, "my answer is consistent with the data I retrieved" is a much weaker claim — the retrieved data might be incomplete, contradictory, or stale.

Why this matters architecturally

The verifiable-test gap is the load-bearing reason data agents need parallel thinking + self-correction:

Coding agent strategy Why it doesn't transfer
Run tests after each edit No equivalent — can't "test" an explanation of a revenue spike
Iterate-until-green No "green" state — when do you stop?
Refactor with confidence Refactor to what? No correctness target
Single-trajectory works Without an oracle, single trajectory commits to potentially-wrong answer

The architectural responses Genie deploys all flow from this gap:

  • concepts/parallel-thinking-trajectory-sampling — sample N trajectories, aggregate; trajectory agreement substitutes for test-pass.
  • concepts/agent-self-correction-loop — detect intra- trajectory inconsistencies as a soft oracle.
  • Surface unanswerability — when the agent cannot reach a consistent answer (low trajectory agreement OR detected data incompleteness), tell the user rather than confabulate.

The unanswerability dimension

A subtle property highlighted in the source: "queries may not always be answerable because of incompleteness in data, and it is important for data agents to be able to identify such cases and surface it back to users."

Three failure modes the agent must distinguish:

Failure Coding agent equivalent Data agent handling
Question unclear Compiler error on syntax Ask clarifying question
Data missing for question Test references undefined function Surface "data is incomplete for this question" — don't confabulate
Data contradictory Tests disagree concepts/source-of-truth-disambiguation reasoning + surface ambiguity if irreconcilable

A data agent that silently confabulates in the missing-data case is strictly worse than one that admits "I cannot answer this with the data available" — the confabulation looks correct and is therefore acted on.

Soft correctness signals (substitutes for the missing oracle)

Without a hard oracle, agent designs use multiple soft signals:

Signal Where it comes from
Trajectory agreement Parallel thinking — N trajectories agree
Internal consistency Self-correction — no intermediate-step contradictions
Source-of-truth ranking Authoritative-source signals corroborate the answer
Confidence calibration Judge sub-agent rates the answer's confidence
Data-completeness check Required data exists / is fresh / covers the question's scope
Statistical anomaly check E.g., "this answer has very-low-sample-size; flag it" — Trinity's "automatic low-sample-size anomaly flagging" (Source: sources/2026-04-29-databricks-companies-winning-with-ai-built-the-data-layer-first)

No single signal is sufficient. Stacking signals is how data agents approximate the missing oracle.

When the gap is most painful

  • High-stakes decisions (CFO asks revenue question; wrong answer drives wrong financial decision).
  • Adversarial or stale data (sources contradict; agent must disambiguate without ground truth).
  • Long-horizon queries (multi-step reasoning compounds errors; no per-step oracle).
  • Cross-system queries (no single source of truth exists; each source is partial).

When the gap is less painful

  • Single-source queries with strong governance (one canonical table; no ambiguity).
  • Closed-form queries (e.g., "how many rows in table X?" — has a deterministic answer in SQL).
  • Sandboxed queries (the agent runs against a fixed snapshot; re-runs are deterministic).

These cases shrink the gap toward the coding-agent shape.

Seen in

  • sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-geniecanonical first wiki naming of the verifiable-test gap as a data-agent design property. Verbatim: "data agents have no corresponding test because the 'specification' is just the high- level user query without a notion of the expected correct answer." Includes the unanswerability sub-property: "queries may not always be answerable because of incompleteness in data." Architectural responses (parallel thinking + self-correction + source-of-truth disambiguation) all derive from this gap.
Last updated · 542 distilled / 1,571 read