Skip to content

PATTERN Cited by 1 source

Four-phase data agent trajectory

A data agent processing a complex enterprise question proceeds in four named phases: (1) parallel multi-agent data discovery, (2) data investigation, (3) self-correction loop, (4) verification. This phase decomposition is canonicalised in the 2026-05-08 Databricks post on Genie, which presents it via a worked example (a CFO question about contradictory revenue dashboards). The four-phase pattern is the architectural shape distinct from coding-agent loops — coding agents typically run write-test-iterate cycles without an explicit asset-discovery phase, because their "corpus" is a flat file system rather than a heterogeneous data graph.

The four phases

┌─────────────────────────────────────────────────────────────────┐
│ Phase 1: Parallel multi-agent data discovery                    │
│  Search sub-agents run in parallel across multiple indices,     │
│  finding candidate tables, dashboards, documents, notebooks     │
│  relevant to the user query.                                    │
└────────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Phase 2: Data investigation                                      │
│   2a. SQL extraction (compose queries against candidate         │
│       tables)                                                    │
│   2b. Comparative analysis (cross-source comparison)            │
│   2c. Root-cause investigation (drill into discrepancies)       │
└────────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Phase 3: Self-correction loop / reconciliation                   │
│   Detect when intermediate calculations contradict initial      │
│   assumptions; revise reasoning chain; re-investigate as        │
│   needed.                                                        │
└────────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Phase 4: Verification                                            │
│   Final answer presented with the reconciled reasoning chain;   │
│   confidence signals + supporting evidence surfaced.            │
└─────────────────────────────────────────────────────────────────┘

The verbatim source: "the agent is able to successfully solve the task by proceeding in different phases: (1) parallel multi-agent data discovery, (2) data investigation, (3) self-correction loop, and (4) verification."

Worked example (from the post)

User question: "Why do two enterprise dashboards reporting the same product's revenue show contradictory spikes on different dates?"

Phase What the agent does
1. Discovery Search sub-agents (in parallel) find: (a) the two dashboards, (b) underlying revenue tables, (c) pricing contract docs, (d) maybe Slack threads referring to recent rate changes
2. Investigation Extract SQL for both dashboards; compare values; identify which day each spikes; investigate why each spike occurred (root cause: contract pricing change, multi-day report cadence, etc.)
3. Self-correction Discover that an early assumption (e.g., "both dashboards use the same revenue computation") is wrong; revise; re-run analysis with corrected assumption
4. Verification Present reconciled explanation with both dashboards' computations + the contract-rate context that explains the timing discrepancy

Without phase 1 (parallel discovery), the agent operates with incomplete asset coverage. Without phase 2 (multi-step investigation), the agent can't compose the cross-source reasoning. Without phase 3 (self-correction), the agent commits to wrong intermediate assumptions. Without phase 4 (verification), the agent presents unsupported claims.

Why this shape (and not the coding-agent shape)

A coding agent's trajectory is typically:

Plan → Edit → Test → (loop until tests pass)

Three properties make this work for code:

  • The corpus is the file tree; trivial to enumerate.
  • Tests are the oracle; deterministic correctness signal.
  • Iteration is cheap; edit-test cycles run fast.

A data agent has none of these:

  • Corpus is heterogeneous — needs the parallel discovery phase to surface candidate assets.
  • No oracle — needs the self-correction phase as a substitute for test-pass.
  • Iteration is expensive — full SQL re-runs over warehouse data cost real money + time; the agent can't afford the same loop density as a coding agent.

The four-phase shape is the structural fit for the data-agent challenges:

Phase Addresses challenge
Phase 1: Discovery #1 Scale of data discovery
Phase 2: Investigation (the actual reasoning step — leverages discovered assets)
Phase 3: Self-correction #3 No verifiable oracle (intra-trajectory check)
Phase 4: Verification #2 Source-of-truth + #3 Surface confidence (aggregate check)

Composes with parallel thinking

The four-phase trajectory runs once per trajectory in a parallel thinking design. Each of the N trajectories goes through all four phases independently; aggregation across trajectories happens after the four phases. So the layered design is:

For each of N trajectories:
  Phase 1 → Phase 2 → Phase 3 → Phase 4
                          [Aggregator across N]
                                Final answer

This is double-redundant correctness signalling — self-correction within each trajectory + parallel-thinking aggregation across trajectories, both substituting for the missing oracle.

Variants / when phases collapse

Not every query needs all four phases at full depth:

  • Trivial queries ("how many rows in table X?") — discovery is one-shot; investigation is one SQL query; self-correction is unnecessary; verification is trivial.
  • Pre-discovered scope (the user is already in a Genie room scoped to specific tables) — phase 1 is partially short-circuited.
  • Repeat queries — earlier discovery cached; later phases shorter.

The phases are logical, not always literal — but the shape of the agent's reasoning maps to them.

Anti-patterns this prevents

Anti-pattern Phase that prevents it
Agent commits to first plausible answer Phase 3 (self-correction)
Agent answers from a single source without checking others Phase 1 (discovery surfaces all candidates)
Agent proceeds with wrong source-of-truth assumption Phase 3 + concepts/source-of-truth-disambiguation
Agent presents unsupported answer Phase 4 (verification)
Agent confabulates when data is incomplete Phase 4 (verification surfaces unanswerability)

When this fits / doesn't

Fits:

  • Agent operating over heterogeneous enterprise data sources.
  • Multi-step questions requiring cross-source reasoning.
  • High-stakes questions where wrong answers have real cost.
  • Lakehouse / data-warehouse substrate with semantic context to ground discovery in.

Doesn't fit:

  • Single-source closed-form queries (full four phases is over-engineering).
  • Agents operating on deterministic substrates (file system, code) — use coding-agent loop instead.
  • Latency-critical paths (full four phases adds latency budget per query).

Seen in

  • sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-geniecanonical first wiki disclosure of the four-phase data-agent trajectory shape. Worked example: CFO question about contradictory revenue dashboards proceeds through (1) parallel multi-agent asset discovery, (2) data investigation (SQL + comparative + root-cause), (3) self-correction loop, (4) final verification. Positioned as the structural fit for the three data-agent unique challenges.
Last updated · 542 distilled / 1,571 read