Skip to content

PATTERN Cited by 1 source

RAG candidate pruning cascade

RAG candidate pruning cascade is a two-LLM cascade where the first LLM emits freeform concepts from its universal knowledge, those concepts are embedded and used to retrieve a small neighbour set from a large candidate corpus, and only the pruned neighbour set is passed to the second LLM as the selection space. The net effect: the second LLM sees ~0.03% of the corpus instead of 100% of it, and all-in per-generation cost drops by a material double-digit percentage.

The pattern specifically lives inside a cascaded generation pipeline — it's the Phase-1→Phase-2 cost-shape that makes the cascade cheaper than a single-step LLM with the full candidate corpus in context.

Shape

[user context] ──► LLM#1 ──► freeform concepts ("eggs", "Mediterranean", "keto")
                                │ embed each concept
                          [embedding space]
                                │ k-NN against large candidate corpus
                          pruned candidate set (~100 / 300K)
                              LLM#2  ←── ~100 candidates in context
                                │       (vs 300K in single-step design)
                           selected outputs

The load-bearing invariants:

  • LLM#1 doesn't need to know the candidate corpus. It emits in natural language from its own prior. This makes LLM#1 cheap to run and cheap to adapt (different corpora, same LLM#1).
  • Embedding similarity is the router. A good embedding model is load-bearing — if "eggs" doesn't land near "Grade A eggs" + "organic eggs" + "liquid eggs" in embedding space, recall is capped regardless of LLM#2 quality.
  • LLM#2 sees only the pruned set. Its job is selection, not generation-from-vocabulary — it picks from ~100 candidates it can reason over jointly.

Canonical wiki instance — Instacart generative recommendations platform (2026-02-26)

Source: sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms

Instacart's Phase-1 page-design LLM emits "freeform product concepts" (the post's named example: "eggs") from its universal knowledge. Embeddings are generated for these concepts. Phase-2's keyword-generation LLM, for each theme, retrieves ~100 nearest neighbours from a 300,000-term keyword corpus via embedding similarity; only this pruned subset is passed to Phase-2 LLM as the keyword-candidate set.

The post's cost disclosure:

"This first-pass candidate pruning reduces input context significantly in the second LLM, reducing all-in generation costs by 15–20% in each generation. This became a core motivator for adopting a cascaded generation architecture. A single-LLM setup would instead require the full keyword corpus to be passed directly into the prompt to maintain the same level of precision."

300K → ~100 = ~0.03% of the corpus visible to the second LLM, for 15–20% all-in generation cost reduction. The cost win is sub-linear in the corpus shrink ratio because the fixed cost of LLM#1 and the per-generation output tokens are not eliminated.

Why it works

Two empirical facts compound:

  1. The candidate corpus is huge relative to the per-theme need. Only ~100 of 300K keywords are plausibly relevant to "Flavor builders for weeknight meals"; the remaining 299,900 are pure context overhead for LLM#2.
  2. Embedding similarity is a cheap router. The embedding cost is dominated by the one-time corpus embedding (pre-computed) plus a cheap per-theme query-embedding + k-NN lookup. Orders of magnitude cheaper than LLM tokens.

This is the same economic logic as cross-encoder reranking and LLM cascades, applied one level up — instead of cascading between models of different sizes, we cascade between a cheap index + a frontier LLM and a cheap index + a small LLM.

When the pattern fits

  • The candidate space is a static-ish large corpus. Keywords, product catalogs, taxonomies, policy libraries, API schemas. Embeddings pay off when the corpus is stable enough to pre-embed.
  • LLM#1 can emit natural-language concepts. If the first-stage output has to be structured against the corpus (e.g. direct product IDs), embedding retrieval fails because the bridge language doesn't exist.
  • LLM#2's selection quality depends on joint reasoning over a short candidate list. If the selection is one-shot per candidate, the pattern reduces to retrieval-rerank.

When it doesn't

  • The corpus is small enough to pass whole. If you can fit 10K candidates in context for the same cost, pruning adds operational complexity for no win.
  • Embedding model is poorly aligned with the domain. Generic embeddings on specialist corpora (legal, medical, SKU taxonomies) can miss the right ~100 neighbours; recall failures show up as LLM#2 missing the right answer without evidence.
  • LLM#1's freeform concepts are too narrow. If the page-design LLM's prior is too general to cover the long tail of user contexts, the k-NN query misses regions of the corpus.

Failure modes

  • Neighbour-set recall hole. The right candidate is in the corpus but not in the top-100 — LLM#2 can't recover. Monitor recall against a ground-truth subset.
  • Phase-1 concept drift. A new user segment appears that Phase-1's freeform vocabulary doesn't cover; Phase-2's candidate set is systematically wrong for that segment.
  • Static-corpus staleness. New keywords / SKUs / policies added post-embedding-job aren't reachable. Re-embedding cadence is an explicit platform responsibility.
  • Embedding-model version skew — if LLM#1's concept embeddings and the corpus embeddings are generated by different models or versions, neighbourhood semantics drift. See concepts/embedding-version-skew.

Relation to sibling patterns

Pattern What's cascaded Coverage at LLM#2
RAG candidate pruning (this page) Two LLMs + embedding retrieval ~0.03% of corpus
concepts/cross-encoder-reranking Bi-encoder retriever → cross-encoder reranker Top-K candidates
concepts/llm-cascade Cheap LLM → expensive LLM on confidence gate 100% see cheap, tail sees expensive
patterns/head-cache-plus-tail-finetuned-model Head-cache → tail LLM Head bypasses LLM entirely

All four are "cheap-then-authoritative" cost shapes at different granularities. RAG candidate pruning is distinctive because the router is embedding similarity, not a confidence score or a cache-hit bit.

Seen in

Last updated · 517 distilled / 1,221 read