PATTERN Cited by 1 source

Four-phase data agent trajectory¶

A data agent processing a complex enterprise question proceeds in four named phases: (1) parallel multi-agent data discovery, (2) data investigation, (3) self-correction loop, (4) verification. This phase decomposition is canonicalised in the 2026-05-08 Databricks post on Genie, which presents it via a worked example (a CFO question about contradictory revenue dashboards). The four-phase pattern is the architectural shape distinct from coding-agent loops — coding agents typically run write-test-iterate cycles without an explicit asset-discovery phase, because their "corpus" is a flat file system rather than a heterogeneous data graph.

The four phases¶

┌─────────────────────────────────────────────────────────────────┐
│ Phase 1: Parallel multi-agent data discovery                    │
│  Search sub-agents run in parallel across multiple indices,     │
│  finding candidate tables, dashboards, documents, notebooks     │
│  relevant to the user query.                                    │
└────────────────────────────┬────────────────────────────────────┘
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│ Phase 2: Data investigation                                      │
│   2a. SQL extraction (compose queries against candidate         │
│       tables)                                                    │
│   2b. Comparative analysis (cross-source comparison)            │
│   2c. Root-cause investigation (drill into discrepancies)       │
└────────────────────────────┬────────────────────────────────────┘
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│ Phase 3: Self-correction loop / reconciliation                   │
│   Detect when intermediate calculations contradict initial      │
│   assumptions; revise reasoning chain; re-investigate as        │
│   needed.                                                        │
└────────────────────────────┬────────────────────────────────────┘
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│ Phase 4: Verification                                            │
│   Final answer presented with the reconciled reasoning chain;   │
│   confidence signals + supporting evidence surfaced.            │
└─────────────────────────────────────────────────────────────────┘

The verbatim source: "the agent is able to successfully solve the task by proceeding in different phases: (1) parallel multi-agent data discovery, (2) data investigation, (3) self-correction loop, and (4) verification."

Worked example (from the post)¶

User question: "Why do two enterprise dashboards reporting the same product's revenue show contradictory spikes on different dates?"

Phase	What the agent does
1. Discovery	Search sub-agents (in parallel) find: (a) the two dashboards, (b) underlying revenue tables, (c) pricing contract docs, (d) maybe Slack threads referring to recent rate changes
2. Investigation	Extract SQL for both dashboards; compare values; identify which day each spikes; investigate why each spike occurred (root cause: contract pricing change, multi-day report cadence, etc.)
3. Self-correction	Discover that an early assumption (e.g., "both dashboards use the same revenue computation") is wrong; revise; re-run analysis with corrected assumption
4. Verification	Present reconciled explanation with both dashboards' computations + the contract-rate context that explains the timing discrepancy

Without phase 1 (parallel discovery), the agent operates with incomplete asset coverage. Without phase 2 (multi-step investigation), the agent can't compose the cross-source reasoning. Without phase 3 (self-correction), the agent commits to wrong intermediate assumptions. Without phase 4 (verification), the agent presents unsupported claims.

Why this shape (and not the coding-agent shape)¶

A coding agent's trajectory is typically:

Plan → Edit → Test → (loop until tests pass)

Three properties make this work for code:

The corpus is the file tree; trivial to enumerate.
Tests are the oracle; deterministic correctness signal.
Iteration is cheap; edit-test cycles run fast.

A data agent has none of these:

Corpus is heterogeneous — needs the parallel discovery phase to surface candidate assets.
No oracle — needs the self-correction phase as a substitute for test-pass.
Iteration is expensive — full SQL re-runs over warehouse data cost real money + time; the agent can't afford the same loop density as a coding agent.

The four-phase shape is the structural fit for the data-agent challenges:

Phase	Addresses challenge
Phase 1: Discovery	#1 Scale of data discovery
Phase 2: Investigation	(the actual reasoning step — leverages discovered assets)
Phase 3: Self-correction	#3 No verifiable oracle (intra-trajectory check)
Phase 4: Verification	#2 Source-of-truth + #3 Surface confidence (aggregate check)

Composes with parallel thinking¶

The four-phase trajectory runs once per trajectory in a parallel thinking design. Each of the N trajectories goes through all four phases independently; aggregation across trajectories happens after the four phases. So the layered design is:

For each of N trajectories:
  Phase 1 → Phase 2 → Phase 3 → Phase 4
                                     │
                                     ▼
                          [Aggregator across N]
                                     │
                                     ▼
                                Final answer

This is double-redundant correctness signalling — self-correction within each trajectory + parallel-thinking aggregation across trajectories, both substituting for the missing oracle.

Variants / when phases collapse¶

Not every query needs all four phases at full depth:

Trivial queries ("how many rows in table X?") — discovery is one-shot; investigation is one SQL query; self-correction is unnecessary; verification is trivial.
Pre-discovered scope (the user is already in a Genie room scoped to specific tables) — phase 1 is partially short-circuited.
Repeat queries — earlier discovery cached; later phases shorter.

The phases are logical, not always literal — but the shape of the agent's reasoning maps to them.

Anti-patterns this prevents¶

Anti-pattern	Phase that prevents it
Agent commits to first plausible answer	Phase 3 (self-correction)
Agent answers from a single source without checking others	Phase 1 (discovery surfaces all candidates)
Agent proceeds with wrong source-of-truth assumption	Phase 3 + concepts/source-of-truth-disambiguation
Agent presents unsupported answer	Phase 4 (verification)
Agent confabulates when data is incomplete	Phase 4 (verification surfaces unanswerability)

When this fits / doesn't¶

Fits:

Agent operating over heterogeneous enterprise data sources.
Multi-step questions requiring cross-source reasoning.
High-stakes questions where wrong answers have real cost.
Lakehouse / data-warehouse substrate with semantic context to ground discovery in.

Doesn't fit:

Single-source closed-form queries (full four phases is over-engineering).
Agents operating on deterministic substrates (file system, code) — use coding-agent loop instead.
Latency-critical paths (full four phases adds latency budget per query).

patterns/parallel-trajectory-sampling-and-aggregation runs this four-phase trajectory N times; aggregates across.
patterns/llm-per-subagent-with-optimized-prompts is what each phase's sub-agent uses; per-phase model assignment.
patterns/semantic-context-grounded-search-index is what phase 1 uses internally.

Seen in¶

sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-genie — canonical first wiki disclosure of the four-phase data-agent trajectory shape. Worked example: CFO question about contradictory revenue dashboards proceeds through (1) parallel multi-agent asset discovery, (2) data investigation (SQL + comparative + root-cause), (3) self-correction loop, (4) final verification. Positioned as the structural fit for the three data-agent unique challenges.