Skip to content

PATTERN Cited by 1 source

Parallel retrieval fusion

Problem

No single retrieval method works best for all queries. The distribution of query shapes looks like:

  • "What was the version of library X?" — FTS exact-token match wins.
  • "Where did we decide to host the frontend?" — fact-key exact-topic match wins.
  • "Who introduced the rate-limit change?" — direct-query vector embedding finds the right memory when both use the same vocabulary.
  • "What framework does the user prefer for data fetching?" — HyDE (hypothetical answer embedding) finds memories the direct embedding misses because the question and answer use different vocabulary.
  • "Did X ever say 'we should never use Y'?" — raw-message FTS catches verbatim detail the extraction step generalised away.

Picking one of these — or even picking a two-channel hybrid — leaves recall on the table for the queries that channel doesn't fit.

The pattern

Run multiple retrieval channels in parallel over the same corpus; merge results via Reciprocal Rank Fusion with channel-specific weights calibrated to the strength of each channel's signal.

                       query
           ┌─────────────┴─────────────┐
           ▼                           ▼
  query analysis              embedding of raw query
  ├── ranked topic keys
  ├── FTS terms + synonyms
  └── HyDE (answer-shaped statement)
   ┌───────┼──────────────────────────────┐
   ▼       ▼         ▼         ▼          ▼
  FTS    fact-key   raw-msg  direct     HyDE
 (stem)  (exact)    (FTS     vector     vector
                    safety
                    net)
  RRF fusion with channel weights
  (strongest signal → highest weight;
   safety-net channel → low weight)
   ties broken by recency
  top candidates → synthesis model
  (deterministic sub-queries like date math
   pre-computed and injected as facts)
  natural-language answer

Structural properties:

  • Query analysis in Stage 1 produces multiple artefacts that feed different downstream channels — topic keys (for exact-key lookup), FTS terms + synonyms (for stemmed keyword match), HyDE statement (for answer-shaped vector search). Stage 1 runs concurrently with raw-query embedding.
  • Channels run in parallel, not sequentially. The fan-out is the whole point; a miss on one channel doesn't block the others.
  • RRF with channel-specific weights. Not all channels carry the same signal strength. Exact-topic-key matches should outvote safety-net raw-message matches even when the raw-message channel ranks its hit higher within its own result list.
  • Safety-net channel included with low weight. A raw-message FTS over the original transcript catches verbatim details the extraction pipeline generalised; giving it low weight means it doesn't dominate but still contributes when no other channel fires.
  • Ties broken by recency. Newer memories rank above older ones at equal fused score — what-you-said-yesterday outranks what-you-said- last-month when both are otherwise equivalent matches.
  • Special-case queries get deterministic handling. Temporal computation ("what did we decide last Tuesday?") is handled via regex + arithmetic before the synthesis LLM sees it; the pre-computed date is injected into the prompt as a fact rather than asking the LLM to do date math.
  • Synthesis is one final LLM call that takes the fused candidates and produces a natural-language answer — not a concatenation of hits.

Canonical wiki instance: Cloudflare Agent Memory

Agent Memory's recall(query) pipeline implements exactly this:

Stage 1 — query analysis + embedding (concurrent). The query analyser produces ranked topic keys + FTS terms with synonyms + a HyDE statement. The raw query is embedded in parallel.

Stage 2 — five retrieval channels in parallel:

  1. FTS with Porter stemming — keyword precision for queries where you know the exact term.
  2. Exact fact-key lookup — the query maps directly to a known topic key.
  3. Raw-message FTS — safety net over stored conversation messages, catching verbatim detail the extraction generalised.
  4. Direct vector search — embedded raw query → ANN over memory vectors.
  5. HyDE vector search — embedded hypothetical-answer → ANN; catches abstract / multi-hop queries where question and answer use different vocabulary.

Stage 3 — RRF fusion:

"Fact-key matches get the highest weight because an exact topic match is the strongest signal. Full-text search, HyDE vectors, and direct vectors are each weighted based on strength of signal. Finally, raw message matches are also included with low weight as a safety net to identify candidate results the extraction pipeline may have missed. Ties are broken by recency, with newer results ranked higher."

— (Cloudflare, 2026-04-17)

Temporal computation is deterministic, not LLM'd:

"Temporal computation is handled deterministically via regex and arithmetic, not by the LLM. The results are injected into the synthesis prompt as pre-computed facts. Models are unreliable at things like date math, so we don't ask them to do it."

— (Cloudflare, 2026-04-17)

Stage 4 — synthesis takes top candidates and produces a natural- language answer to the original query.

Why fuse, not pick-one

Each channel has a query-shape sweet spot; a probabilistic recall graph across all query shapes looks like a Pareto front, not a winner. Running all channels and fusing means:

  • No query-class regression. The direct-vector channel still runs even when HyDE is added, so queries where raw-question embedding lands on the target don't regress.
  • Recall adds across channels where they disagree, precision holds where they agree (RRF gives co-retrieved items a cumulative boost).
  • Tunable without re-architecting. Channel weights are a config dial; a bad HyDE implementation can be downweighted rather than removed, preserving the other four channels.
  • Iterable. New channels can be added (e.g. "learned-relevance reranker over top 50") without disturbing existing ones.

Trade-offs

Dimension Impact
Latency All channels run in parallel → latency = max(channel latency) + fusion + synthesis. Additive LLM call cost in Stage 1 (analyser) + Stage 4 (synthesis).
Cost 5× storage reads, 2 LLM calls per query (analysis + synthesis), HyDE adds a third. Partially mitigated by parallel execution of channels + prompt caching via session affinity.
Quality on single-channel queries Neutral (RRF preserves correctly-ranked channel winners).
Quality on multi-hop / abstract queries Significant lift — HyDE channel catches these.
Implementation complexity N channels + analyser + fusion + synthesis, vs single-channel: each stage is tunable independently.

Complementary moves on the write path

Read-side fusion is one attack on the question-answer vocabulary asymmetry. The symmetric attack on the write side is to prepend anticipated questions to the stored embedding text:

"The embedding text prepends the 3-5 search queries generated during classification to the memory content itself, bridging the gap between how memories are written (declaratively: 'user prefers dark mode') and how they're searched (interrogatively: 'what theme does the user want?')."

— (Cloudflare, 2026-04-17)

Combined: the asymmetry is attacked on both sides — questions become hypothetical answers (HyDE on read), stored facts carry anticipated questions (on write).

Seen in

Last updated · 200 distilled / 1,178 read