PATTERN Cited by 1 source

RAG candidate pruning cascade¶

RAG candidate pruning cascade is a two-LLM cascade where the first LLM emits freeform concepts from its universal knowledge, those concepts are embedded and used to retrieve a small neighbour set from a large candidate corpus, and only the pruned neighbour set is passed to the second LLM as the selection space. The net effect: the second LLM sees ~0.03% of the corpus instead of 100% of it, and all-in per-generation cost drops by a material double-digit percentage.

The pattern specifically lives inside a cascaded generation pipeline — it's the Phase-1→Phase-2 cost-shape that makes the cascade cheaper than a single-step LLM with the full candidate corpus in context.

Shape¶

[user context] ──► LLM#1 ──► freeform concepts ("eggs", "Mediterranean", "keto")
                                │
                                │ embed each concept
                                ▼
                          [embedding space]
                                │
                                │ k-NN against large candidate corpus
                                ▼
                          pruned candidate set (~100 / 300K)
                                │
                                ▼
                              LLM#2  ←── ~100 candidates in context
                                │       (vs 300K in single-step design)
                                ▼
                           selected outputs

The load-bearing invariants:

LLM#1 doesn't need to know the candidate corpus. It emits in natural language from its own prior. This makes LLM#1 cheap to run and cheap to adapt (different corpora, same LLM#1).
Embedding similarity is the router. A good embedding model is load-bearing — if "eggs" doesn't land near "Grade A eggs" + "organic eggs" + "liquid eggs" in embedding space, recall is capped regardless of LLM#2 quality.
LLM#2 sees only the pruned set. Its job is selection, not generation-from-vocabulary — it picks from ~100 candidates it can reason over jointly.

Canonical wiki instance — Instacart generative recommendations platform (2026-02-26)¶

Source: sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms

Instacart's Phase-1 page-design LLM emits "freeform product concepts" (the post's named example: "eggs") from its universal knowledge. Embeddings are generated for these concepts. Phase-2's keyword-generation LLM, for each theme, retrieves ~100 nearest neighbours from a 300,000-term keyword corpus via embedding similarity; only this pruned subset is passed to Phase-2 LLM as the keyword-candidate set.

The post's cost disclosure:

"This first-pass candidate pruning reduces input context significantly in the second LLM, reducing all-in generation costs by 15–20% in each generation. This became a core motivator for adopting a cascaded generation architecture. A single-LLM setup would instead require the full keyword corpus to be passed directly into the prompt to maintain the same level of precision."

300K → ~100 = ~0.03% of the corpus visible to the second LLM, for 15–20% all-in generation cost reduction. The cost win is sub-linear in the corpus shrink ratio because the fixed cost of LLM#1 and the per-generation output tokens are not eliminated.

Why it works¶

Two empirical facts compound:

The candidate corpus is huge relative to the per-theme need. Only ~100 of 300K keywords are plausibly relevant to "Flavor builders for weeknight meals"; the remaining 299,900 are pure context overhead for LLM#2.
Embedding similarity is a cheap router. The embedding cost is dominated by the one-time corpus embedding (pre-computed) plus a cheap per-theme query-embedding + k-NN lookup. Orders of magnitude cheaper than LLM tokens.

This is the same economic logic as cross-encoder reranking and LLM cascades, applied one level up — instead of cascading between models of different sizes, we cascade between a cheap index + a frontier LLM and a cheap index + a small LLM.

When the pattern fits¶

The candidate space is a static-ish large corpus. Keywords, product catalogs, taxonomies, policy libraries, API schemas. Embeddings pay off when the corpus is stable enough to pre-embed.
LLM#1 can emit natural-language concepts. If the first-stage output has to be structured against the corpus (e.g. direct product IDs), embedding retrieval fails because the bridge language doesn't exist.
LLM#2's selection quality depends on joint reasoning over a short candidate list. If the selection is one-shot per candidate, the pattern reduces to retrieval-rerank.

When it doesn't¶

The corpus is small enough to pass whole. If you can fit 10K candidates in context for the same cost, pruning adds operational complexity for no win.
Embedding model is poorly aligned with the domain. Generic embeddings on specialist corpora (legal, medical, SKU taxonomies) can miss the right ~100 neighbours; recall failures show up as LLM#2 missing the right answer without evidence.
LLM#1's freeform concepts are too narrow. If the page-design LLM's prior is too general to cover the long tail of user contexts, the k-NN query misses regions of the corpus.

Failure modes¶

Neighbour-set recall hole. The right candidate is in the corpus but not in the top-100 — LLM#2 can't recover. Monitor recall against a ground-truth subset.
Phase-1 concept drift. A new user segment appears that Phase-1's freeform vocabulary doesn't cover; Phase-2's candidate set is systematically wrong for that segment.
Static-corpus staleness. New keywords / SKUs / policies added post-embedding-job aren't reachable. Re-embedding cadence is an explicit platform responsibility.
Embedding-model version skew — if LLM#1's concept embeddings and the corpus embeddings are generated by different models or versions, neighbourhood semantics drift. See concepts/embedding-version-skew.

Relation to sibling patterns¶

Pattern	What's cascaded	Coverage at LLM#2
RAG candidate pruning (this page)	Two LLMs + embedding retrieval	~0.03% of corpus
concepts/cross-encoder-reranking	Bi-encoder retriever → cross-encoder reranker	Top-K candidates
concepts/llm-cascade	Cheap LLM → expensive LLM on confidence gate	100% see cheap, tail sees expensive
patterns/head-cache-plus-tail-finetuned-model	Head-cache → tail LLM	Head bypasses LLM entirely

All four are "cheap-then-authoritative" cost shapes at different granularities. RAG candidate pruning is distinctive because the router is embedding similarity, not a confidence score or a cache-hit bit.

Seen in¶

sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms — canonical wiki instance at Instacart's Phase-1→Phase-2 cascade. 300K-term keyword corpus, ~100 nearest-neighbour subset per theme, 15–20% all-in cost reduction per generation. Named by Instacart as "a core motivator for adopting a cascaded generation architecture."

patterns/top-down-cascaded-page-generation — the host pattern this one lives inside.
patterns/teacher-student-model-compression — complementary Phase-2 cost lever (cheaper LLM#2 via distillation).
concepts/retrieval-augmented-generation — the parent concept.
concepts/vector-embedding — the retrieval mechanism.
concepts/cascaded-llm-generation — the parent concept.
concepts/context-engineering — the discipline this sits inside.
systems/instacart-generative-recommendations-platform — canonical production consumer.
companies/instacart