Skip to content

PATTERN Cited by 1 source

Parallel trajectory sampling and aggregation

Parallel trajectory sampling and aggregation is the agent-design pattern of running an agent's full reasoning trajectory N times in parallel over the same query and aggregating findings across the N trajectories to compute the final answer. The pattern is the structural compensation for the verifiable-test gap in data-agent design — without an oracle to test correctness against, trajectory agreement substitutes as a soft correctness signal.

Disclosed by Databricks for Genie in the 2026-05-08 post under the name "Parallel Thinking" — see the underlying concepts/parallel-thinking-trajectory-sampling concept for the conceptual framing.

The pattern

                          Query
       ┌────────────────────┼────────────────────┐
       │                    │                    │
       ▼                    ▼                    ▼
  Trajectory 1       Trajectory 2  ...    Trajectory N
  (4 phases)         (4 phases)           (4 phases)
       │                    │                    │
       └────────────────────┴────────────────────┘
                       Aggregator
                       Final answer
                  (+ confidence / disagreement signals)

Each of the N trajectories runs the full four-phase data-agent trajectory independently; aggregation happens after all N complete (or after a subset complete + a quorum-style stopping rule).

Components

  1. Sampling diversity source — some randomness in the trajectories so they don't collapse to identical reasoning. Sources of diversity:

  2. LLM sampling temperature on planning + intermediate steps.

  3. Different prompt variations.
  4. Different LLM choices per trajectory (some trajectories use Opus, others GPT, others Gemini — Multi-LLM composes).
  5. Different ordering of search results / asset prioritisation.

  6. Independent N trajectories — each runs its own discovery, investigation, self-correction, verification.

  7. Aggregator — combines findings. Strategies (not all disclosed for Genie):

  8. Vote / consensus — pick most-frequent answer.

  9. Judge — separate LLM evaluates the N candidates.
  10. Weighted — trajectories self-report confidence; aggregator weights accordingly.
  11. Union of evidence — assemble a reasoning chain from intermediate findings across trajectories rather than picking one.

  12. Disagreement handling — when N trajectories disagree substantially, the aggregator should surface the disagreement rather than commit to a confident wrong-looking answer. This connects to the unanswerability property of the verifiable-test gap: high disagreement = low confidence = surface to user.

Disclosed cost / benefit (Genie)

Property Single trajectory Parallel sampling
Accuracy (baseline) Significantly improved (Figure 5)
Latency (baseline) Some additional latency
Token cost (baseline) Some additional cost
Models tested n/a GPT-5.4, Opus-4.6 (Figure 5)
Pareto with Multi-LLM n/a Combined → simultaneous accuracy + cost + latency improvement (Figure 1 end-state)

The disclosed Pareto move: parallel sampling alone trades cost for accuracy. Combined with Multi- LLM + GEPA-optimised prompts, the end-state hits simultaneous improvement on all three axes.

Why this works (and when it doesn't)

The pattern works because:

  • Independent samples have independent errors — when trajectories err in different ways, they disagree; agreement is informative.
  • Soft consensus substitutes for hard oracle"4 of 5 agree" is a meaningful correctness signal even without ground truth.
  • Disagreement is itself useful — agents can surface disagreement as a confidence signal to users.

The pattern fails when:

  • Errors are systematic — all N trajectories err in the same way (e.g., all use the same wrong table because the search index ranks it first); agreement is meaningless.
  • Sampling diversity is too low — trajectories collapse to identical reasoning chains; no independent signal.
  • Aggregator is the weak link — picking the wrong aggregation strategy (e.g., simple vote when answers are continuous-valued) degrades the gain.

Compositions

Composes with How
patterns/four-phase-data-agent-trajectory Each trajectory IS a four-phase trajectory
patterns/llm-per-subagent-with-optimized-prompts Different sub-agents per trajectory can use different LLMs
patterns/semantic-context-grounded-search-index Each trajectory's discovery sub-agent uses the same index but may rank differently
concepts/agent-self-correction-loop Each trajectory has its own self-correction (intra-trajectory); aggregation handles cross-trajectory

When this fits / doesn't

Fits:

  • Open-ended queries with no verifiable oracle.
  • Cost budget can absorb N× model invocation (or N is small enough).
  • Latency budget allows true parallel execution (not serial).
  • Aggregator has a quality signal stronger than any single trajectory (judge LLM, voting, etc.).
  • Sampling diversity is achievable (temperature > 0, model variety, prompt variation).

Doesn't fit:

  • Tight latency budgets that can't absorb N× invocation.
  • Cost budgets that can't absorb N× tokens.
  • Tasks with cheap deterministic oracles (use the oracle, not sampling).
  • Tasks where errors are systematic across samples.
  • Single-trajectory tasks where the full reasoning chain is short.

Anti-patterns

  • Sampling without aggregation — running N trajectories and picking the first to finish is just expensive single-trajectory.
  • Aggregator picks majority blindly — when 3-of-5 agree on a wrong answer because they share a systematic bias, majority vote reproduces the bias. Aggregator should use independent signals (judge, evidence corroboration).
  • N is fixed regardless of task complexity — simple queries don't need N=10; complex queries may need more. Adaptive N is better.
  • Trajectories don't surface confidence — without per-trajectory confidence the aggregator can't weight; can't detect "all 5 are uncertain" signal.
  • Self-consistency in chain-of-thought prompting — same idea at the prompt level (sample N final answers, vote); this pattern extends to full multi-step trajectories with intermediate state.
  • Ensemble methods in ML — bag of independent models; pattern is the agent-architecture analog.
  • patterns/llm-cascade (if exists) — same task with escalation; this pattern is parallel rather than serial.

Seen in

  • sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-geniecanonical first wiki disclosure of parallel trajectory sampling + aggregation as a named agent design pattern. Genie's "Parallel Thinking" technique, sampling N trajectories + aggregating across them. Disclosed accuracy improvement (Figure 5) on GPT-5.4 + Opus-4.6 baselines; cost/latency overhead recovered when combined with Multi-LLM (Figure 1 end-state). Positioned as the structural compensation for the verifiable- test gap that data agents face.
Last updated · 542 distilled / 1,571 read