PATTERN Cited by 1 source

Parallel trajectory sampling and aggregation¶

Parallel trajectory sampling and aggregation is the agent-design pattern of running an agent's full reasoning trajectory N times in parallel over the same query and aggregating findings across the N trajectories to compute the final answer. The pattern is the structural compensation for the verifiable-test gap in data-agent design — without an oracle to test correctness against, trajectory agreement substitutes as a soft correctness signal.

Disclosed by Databricks for Genie in the 2026-05-08 post under the name "Parallel Thinking" — see the underlying concepts/parallel-thinking-trajectory-sampling concept for the conceptual framing.

The pattern¶

                          Query
                            │
       ┌────────────────────┼────────────────────┐
       │                    │                    │
       ▼                    ▼                    ▼
  Trajectory 1       Trajectory 2  ...    Trajectory N
  (4 phases)         (4 phases)           (4 phases)
       │                    │                    │
       └────────────────────┴────────────────────┘
                            │
                            ▼
                       Aggregator
                            │
                            ▼
                       Final answer
                  (+ confidence / disagreement signals)

Each of the N trajectories runs the full four-phase data-agent trajectory independently; aggregation happens after all N complete (or after a subset complete + a quorum-style stopping rule).

Components¶

Sampling diversity source — some randomness in the trajectories so they don't collapse to identical reasoning. Sources of diversity:
LLM sampling temperature on planning + intermediate steps.
Different prompt variations.
Different LLM choices per trajectory (some trajectories use Opus, others GPT, others Gemini — Multi-LLM composes).
Different ordering of search results / asset prioritisation.
Independent N trajectories — each runs its own discovery, investigation, self-correction, verification.
Aggregator — combines findings. Strategies (not all disclosed for Genie):
Vote / consensus — pick most-frequent answer.
Judge — separate LLM evaluates the N candidates.
Weighted — trajectories self-report confidence; aggregator weights accordingly.
Union of evidence — assemble a reasoning chain from intermediate findings across trajectories rather than picking one.
Disagreement handling — when N trajectories disagree substantially, the aggregator should surface the disagreement rather than commit to a confident wrong-looking answer. This connects to the unanswerability property of the verifiable-test gap: high disagreement = low confidence = surface to user.

Disclosed cost / benefit (Genie)¶

Property	Single trajectory	Parallel sampling
Accuracy	(baseline)	Significantly improved (Figure 5)
Latency	(baseline)	Some additional latency
Token cost	(baseline)	Some additional cost
Models tested	n/a	GPT-5.4, Opus-4.6 (Figure 5)
Pareto with Multi-LLM	n/a	Combined → simultaneous accuracy + cost + latency improvement (Figure 1 end-state)

The disclosed Pareto move: parallel sampling alone trades cost for accuracy. Combined with Multi- LLM + GEPA-optimised prompts, the end-state hits simultaneous improvement on all three axes.

Why this works (and when it doesn't)¶

The pattern works because:

Independent samples have independent errors — when trajectories err in different ways, they disagree; agreement is informative.
Soft consensus substitutes for hard oracle — "4 of 5 agree" is a meaningful correctness signal even without ground truth.
Disagreement is itself useful — agents can surface disagreement as a confidence signal to users.

The pattern fails when:

Errors are systematic — all N trajectories err in the same way (e.g., all use the same wrong table because the search index ranks it first); agreement is meaningless.
Sampling diversity is too low — trajectories collapse to identical reasoning chains; no independent signal.
Aggregator is the weak link — picking the wrong aggregation strategy (e.g., simple vote when answers are continuous-valued) degrades the gain.

Compositions¶

Composes with	How
patterns/four-phase-data-agent-trajectory	Each trajectory IS a four-phase trajectory
patterns/llm-per-subagent-with-optimized-prompts	Different sub-agents per trajectory can use different LLMs
patterns/semantic-context-grounded-search-index	Each trajectory's discovery sub-agent uses the same index but may rank differently
concepts/agent-self-correction-loop	Each trajectory has its own self-correction (intra-trajectory); aggregation handles cross-trajectory

When this fits / doesn't¶

Fits:

Open-ended queries with no verifiable oracle.
Cost budget can absorb N× model invocation (or N is small enough).
Latency budget allows true parallel execution (not serial).
Aggregator has a quality signal stronger than any single trajectory (judge LLM, voting, etc.).
Sampling diversity is achievable (temperature > 0, model variety, prompt variation).

Doesn't fit:

Tight latency budgets that can't absorb N× invocation.
Cost budgets that can't absorb N× tokens.
Tasks with cheap deterministic oracles (use the oracle, not sampling).
Tasks where errors are systematic across samples.
Single-trajectory tasks where the full reasoning chain is short.

Anti-patterns¶

Sampling without aggregation — running N trajectories and picking the first to finish is just expensive single-trajectory.
Aggregator picks majority blindly — when 3-of-5 agree on a wrong answer because they share a systematic bias, majority vote reproduces the bias. Aggregator should use independent signals (judge, evidence corroboration).
N is fixed regardless of task complexity — simple queries don't need N=10; complex queries may need more. Adaptive N is better.
Trajectories don't surface confidence — without per-trajectory confidence the aggregator can't weight; can't detect "all 5 are uncertain" signal.

Self-consistency in chain-of-thought prompting — same idea at the prompt level (sample N final answers, vote); this pattern extends to full multi-step trajectories with intermediate state.
Ensemble methods in ML — bag of independent models; pattern is the agent-architecture analog.
patterns/llm-cascade (if exists) — same task with escalation; this pattern is parallel rather than serial.

Seen in¶

sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-genie — canonical first wiki disclosure of parallel trajectory sampling + aggregation as a named agent design pattern. Genie's "Parallel Thinking" technique, sampling N trajectories + aggregating across them. Disclosed accuracy improvement (Figure 5) on GPT-5.4 + Opus-4.6 baselines; cost/latency overhead recovered when combined with Multi-LLM (Figure 1 end-state). Positioned as the structural compensation for the verifiable- test gap that data agents face.