PATTERN Cited by 1 source
Parallel trajectory sampling and aggregation¶
Parallel trajectory sampling and aggregation is the agent-design pattern of running an agent's full reasoning trajectory N times in parallel over the same query and aggregating findings across the N trajectories to compute the final answer. The pattern is the structural compensation for the verifiable-test gap in data-agent design — without an oracle to test correctness against, trajectory agreement substitutes as a soft correctness signal.
Disclosed by Databricks for Genie in the 2026-05-08 post under the name "Parallel Thinking" — see the underlying concepts/parallel-thinking-trajectory-sampling concept for the conceptual framing.
The pattern¶
Query
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
Trajectory 1 Trajectory 2 ... Trajectory N
(4 phases) (4 phases) (4 phases)
│ │ │
└────────────────────┴────────────────────┘
│
▼
Aggregator
│
▼
Final answer
(+ confidence / disagreement signals)
Each of the N trajectories runs the full four-phase data-agent trajectory independently; aggregation happens after all N complete (or after a subset complete + a quorum-style stopping rule).
Components¶
-
Sampling diversity source — some randomness in the trajectories so they don't collapse to identical reasoning. Sources of diversity:
-
LLM sampling temperature on planning + intermediate steps.
- Different prompt variations.
- Different LLM choices per trajectory (some trajectories use Opus, others GPT, others Gemini — Multi-LLM composes).
-
Different ordering of search results / asset prioritisation.
-
Independent N trajectories — each runs its own discovery, investigation, self-correction, verification.
-
Aggregator — combines findings. Strategies (not all disclosed for Genie):
-
Vote / consensus — pick most-frequent answer.
- Judge — separate LLM evaluates the N candidates.
- Weighted — trajectories self-report confidence; aggregator weights accordingly.
-
Union of evidence — assemble a reasoning chain from intermediate findings across trajectories rather than picking one.
-
Disagreement handling — when N trajectories disagree substantially, the aggregator should surface the disagreement rather than commit to a confident wrong-looking answer. This connects to the unanswerability property of the verifiable-test gap: high disagreement = low confidence = surface to user.
Disclosed cost / benefit (Genie)¶
| Property | Single trajectory | Parallel sampling |
|---|---|---|
| Accuracy | (baseline) | Significantly improved (Figure 5) |
| Latency | (baseline) | Some additional latency |
| Token cost | (baseline) | Some additional cost |
| Models tested | n/a | GPT-5.4, Opus-4.6 (Figure 5) |
| Pareto with Multi-LLM | n/a | Combined → simultaneous accuracy + cost + latency improvement (Figure 1 end-state) |
The disclosed Pareto move: parallel sampling alone trades cost for accuracy. Combined with Multi- LLM + GEPA-optimised prompts, the end-state hits simultaneous improvement on all three axes.
Why this works (and when it doesn't)¶
The pattern works because:
- Independent samples have independent errors — when trajectories err in different ways, they disagree; agreement is informative.
- Soft consensus substitutes for hard oracle — "4 of 5 agree" is a meaningful correctness signal even without ground truth.
- Disagreement is itself useful — agents can surface disagreement as a confidence signal to users.
The pattern fails when:
- Errors are systematic — all N trajectories err in the same way (e.g., all use the same wrong table because the search index ranks it first); agreement is meaningless.
- Sampling diversity is too low — trajectories collapse to identical reasoning chains; no independent signal.
- Aggregator is the weak link — picking the wrong aggregation strategy (e.g., simple vote when answers are continuous-valued) degrades the gain.
Compositions¶
| Composes with | How |
|---|---|
| patterns/four-phase-data-agent-trajectory | Each trajectory IS a four-phase trajectory |
| patterns/llm-per-subagent-with-optimized-prompts | Different sub-agents per trajectory can use different LLMs |
| patterns/semantic-context-grounded-search-index | Each trajectory's discovery sub-agent uses the same index but may rank differently |
| concepts/agent-self-correction-loop | Each trajectory has its own self-correction (intra-trajectory); aggregation handles cross-trajectory |
When this fits / doesn't¶
Fits:
- Open-ended queries with no verifiable oracle.
- Cost budget can absorb N× model invocation (or N is small enough).
- Latency budget allows true parallel execution (not serial).
- Aggregator has a quality signal stronger than any single trajectory (judge LLM, voting, etc.).
- Sampling diversity is achievable (temperature > 0, model variety, prompt variation).
Doesn't fit:
- Tight latency budgets that can't absorb N× invocation.
- Cost budgets that can't absorb N× tokens.
- Tasks with cheap deterministic oracles (use the oracle, not sampling).
- Tasks where errors are systematic across samples.
- Single-trajectory tasks where the full reasoning chain is short.
Anti-patterns¶
- Sampling without aggregation — running N trajectories and picking the first to finish is just expensive single-trajectory.
- Aggregator picks majority blindly — when 3-of-5 agree on a wrong answer because they share a systematic bias, majority vote reproduces the bias. Aggregator should use independent signals (judge, evidence corroboration).
- N is fixed regardless of task complexity — simple queries don't need N=10; complex queries may need more. Adaptive N is better.
- Trajectories don't surface confidence — without per-trajectory confidence the aggregator can't weight; can't detect "all 5 are uncertain" signal.
Relationship to related patterns¶
- Self-consistency in chain-of-thought prompting — same idea at the prompt level (sample N final answers, vote); this pattern extends to full multi-step trajectories with intermediate state.
- Ensemble methods in ML — bag of independent models; pattern is the agent-architecture analog.
- patterns/llm-cascade (if exists) — same task with escalation; this pattern is parallel rather than serial.
Seen in¶
- sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-genie — canonical first wiki disclosure of parallel trajectory sampling + aggregation as a named agent design pattern. Genie's "Parallel Thinking" technique, sampling N trajectories + aggregating across them. Disclosed accuracy improvement (Figure 5) on GPT-5.4 + Opus-4.6 baselines; cost/latency overhead recovered when combined with Multi-LLM (Figure 1 end-state). Positioned as the structural compensation for the verifiable- test gap that data agents face.