PATTERN Cited by 1 source
Four-phase data agent trajectory¶
A data agent processing a complex enterprise question proceeds in four named phases: (1) parallel multi-agent data discovery, (2) data investigation, (3) self-correction loop, (4) verification. This phase decomposition is canonicalised in the 2026-05-08 Databricks post on Genie, which presents it via a worked example (a CFO question about contradictory revenue dashboards). The four-phase pattern is the architectural shape distinct from coding-agent loops — coding agents typically run write-test-iterate cycles without an explicit asset-discovery phase, because their "corpus" is a flat file system rather than a heterogeneous data graph.
The four phases¶
┌─────────────────────────────────────────────────────────────────┐
│ Phase 1: Parallel multi-agent data discovery │
│ Search sub-agents run in parallel across multiple indices, │
│ finding candidate tables, dashboards, documents, notebooks │
│ relevant to the user query. │
└────────────────────────────┬────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ Phase 2: Data investigation │
│ 2a. SQL extraction (compose queries against candidate │
│ tables) │
│ 2b. Comparative analysis (cross-source comparison) │
│ 2c. Root-cause investigation (drill into discrepancies) │
└────────────────────────────┬────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ Phase 3: Self-correction loop / reconciliation │
│ Detect when intermediate calculations contradict initial │
│ assumptions; revise reasoning chain; re-investigate as │
│ needed. │
└────────────────────────────┬────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ Phase 4: Verification │
│ Final answer presented with the reconciled reasoning chain; │
│ confidence signals + supporting evidence surfaced. │
└─────────────────────────────────────────────────────────────────┘
The verbatim source: "the agent is able to successfully solve the task by proceeding in different phases: (1) parallel multi-agent data discovery, (2) data investigation, (3) self-correction loop, and (4) verification."
Worked example (from the post)¶
User question: "Why do two enterprise dashboards reporting the same product's revenue show contradictory spikes on different dates?"
| Phase | What the agent does |
|---|---|
| 1. Discovery | Search sub-agents (in parallel) find: (a) the two dashboards, (b) underlying revenue tables, (c) pricing contract docs, (d) maybe Slack threads referring to recent rate changes |
| 2. Investigation | Extract SQL for both dashboards; compare values; identify which day each spikes; investigate why each spike occurred (root cause: contract pricing change, multi-day report cadence, etc.) |
| 3. Self-correction | Discover that an early assumption (e.g., "both dashboards use the same revenue computation") is wrong; revise; re-run analysis with corrected assumption |
| 4. Verification | Present reconciled explanation with both dashboards' computations + the contract-rate context that explains the timing discrepancy |
Without phase 1 (parallel discovery), the agent operates with incomplete asset coverage. Without phase 2 (multi-step investigation), the agent can't compose the cross-source reasoning. Without phase 3 (self-correction), the agent commits to wrong intermediate assumptions. Without phase 4 (verification), the agent presents unsupported claims.
Why this shape (and not the coding-agent shape)¶
A coding agent's trajectory is typically:
Three properties make this work for code:
- The corpus is the file tree; trivial to enumerate.
- Tests are the oracle; deterministic correctness signal.
- Iteration is cheap; edit-test cycles run fast.
A data agent has none of these:
- Corpus is heterogeneous — needs the parallel discovery phase to surface candidate assets.
- No oracle — needs the self-correction phase as a substitute for test-pass.
- Iteration is expensive — full SQL re-runs over warehouse data cost real money + time; the agent can't afford the same loop density as a coding agent.
The four-phase shape is the structural fit for the data-agent challenges:
| Phase | Addresses challenge |
|---|---|
| Phase 1: Discovery | #1 Scale of data discovery |
| Phase 2: Investigation | (the actual reasoning step — leverages discovered assets) |
| Phase 3: Self-correction | #3 No verifiable oracle (intra-trajectory check) |
| Phase 4: Verification | #2 Source-of-truth + #3 Surface confidence (aggregate check) |
Composes with parallel thinking¶
The four-phase trajectory runs once per trajectory in a parallel thinking design. Each of the N trajectories goes through all four phases independently; aggregation across trajectories happens after the four phases. So the layered design is:
For each of N trajectories:
Phase 1 → Phase 2 → Phase 3 → Phase 4
│
▼
[Aggregator across N]
│
▼
Final answer
This is double-redundant correctness signalling — self-correction within each trajectory + parallel-thinking aggregation across trajectories, both substituting for the missing oracle.
Variants / when phases collapse¶
Not every query needs all four phases at full depth:
- Trivial queries ("how many rows in table X?") — discovery is one-shot; investigation is one SQL query; self-correction is unnecessary; verification is trivial.
- Pre-discovered scope (the user is already in a Genie room scoped to specific tables) — phase 1 is partially short-circuited.
- Repeat queries — earlier discovery cached; later phases shorter.
The phases are logical, not always literal — but the shape of the agent's reasoning maps to them.
Anti-patterns this prevents¶
| Anti-pattern | Phase that prevents it |
|---|---|
| Agent commits to first plausible answer | Phase 3 (self-correction) |
| Agent answers from a single source without checking others | Phase 1 (discovery surfaces all candidates) |
| Agent proceeds with wrong source-of-truth assumption | Phase 3 + concepts/source-of-truth-disambiguation |
| Agent presents unsupported answer | Phase 4 (verification) |
| Agent confabulates when data is incomplete | Phase 4 (verification surfaces unanswerability) |
When this fits / doesn't¶
Fits:
- Agent operating over heterogeneous enterprise data sources.
- Multi-step questions requiring cross-source reasoning.
- High-stakes questions where wrong answers have real cost.
- Lakehouse / data-warehouse substrate with semantic context to ground discovery in.
Doesn't fit:
- Single-source closed-form queries (full four phases is over-engineering).
- Agents operating on deterministic substrates (file system, code) — use coding-agent loop instead.
- Latency-critical paths (full four phases adds latency budget per query).
Relationship to related patterns¶
- patterns/parallel-trajectory-sampling-and-aggregation runs this four-phase trajectory N times; aggregates across.
- patterns/llm-per-subagent-with-optimized-prompts is what each phase's sub-agent uses; per-phase model assignment.
- patterns/semantic-context-grounded-search-index is what phase 1 uses internally.
Seen in¶
- sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-genie — canonical first wiki disclosure of the four-phase data-agent trajectory shape. Worked example: CFO question about contradictory revenue dashboards proceeds through (1) parallel multi-agent asset discovery, (2) data investigation (SQL + comparative + root-cause), (3) self-correction loop, (4) final verification. Positioned as the structural fit for the three data-agent unique challenges.
Related¶
- systems/databricks-genie
- concepts/data-agent-unique-challenges
- concepts/parallel-thinking-trajectory-sampling
- concepts/agent-self-correction-loop
- concepts/source-of-truth-disambiguation
- concepts/specialized-knowledge-search
- patterns/parallel-trajectory-sampling-and-aggregation
- patterns/llm-per-subagent-with-optimized-prompts
- patterns/semantic-context-grounded-search-index