PATTERN Cited by 1 source
RAG candidate pruning cascade¶
RAG candidate pruning cascade is a two-LLM cascade where the first LLM emits freeform concepts from its universal knowledge, those concepts are embedded and used to retrieve a small neighbour set from a large candidate corpus, and only the pruned neighbour set is passed to the second LLM as the selection space. The net effect: the second LLM sees ~0.03% of the corpus instead of 100% of it, and all-in per-generation cost drops by a material double-digit percentage.
The pattern specifically lives inside a cascaded generation pipeline — it's the Phase-1→Phase-2 cost-shape that makes the cascade cheaper than a single-step LLM with the full candidate corpus in context.
Shape¶
[user context] ──► LLM#1 ──► freeform concepts ("eggs", "Mediterranean", "keto")
│
│ embed each concept
▼
[embedding space]
│
│ k-NN against large candidate corpus
▼
pruned candidate set (~100 / 300K)
│
▼
LLM#2 ←── ~100 candidates in context
│ (vs 300K in single-step design)
▼
selected outputs
The load-bearing invariants:
- LLM#1 doesn't need to know the candidate corpus. It emits in natural language from its own prior. This makes LLM#1 cheap to run and cheap to adapt (different corpora, same LLM#1).
- Embedding similarity is the router. A good embedding model is load-bearing — if "eggs" doesn't land near "Grade A eggs" + "organic eggs" + "liquid eggs" in embedding space, recall is capped regardless of LLM#2 quality.
- LLM#2 sees only the pruned set. Its job is selection, not generation-from-vocabulary — it picks from ~100 candidates it can reason over jointly.
Canonical wiki instance — Instacart generative recommendations platform (2026-02-26)¶
Source: sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms
Instacart's Phase-1 page-design LLM emits "freeform product concepts" (the post's named example: "eggs") from its universal knowledge. Embeddings are generated for these concepts. Phase-2's keyword-generation LLM, for each theme, retrieves ~100 nearest neighbours from a 300,000-term keyword corpus via embedding similarity; only this pruned subset is passed to Phase-2 LLM as the keyword-candidate set.
The post's cost disclosure:
"This first-pass candidate pruning reduces input context significantly in the second LLM, reducing all-in generation costs by 15–20% in each generation. This became a core motivator for adopting a cascaded generation architecture. A single-LLM setup would instead require the full keyword corpus to be passed directly into the prompt to maintain the same level of precision."
300K → ~100 = ~0.03% of the corpus visible to the second LLM, for 15–20% all-in generation cost reduction. The cost win is sub-linear in the corpus shrink ratio because the fixed cost of LLM#1 and the per-generation output tokens are not eliminated.
Why it works¶
Two empirical facts compound:
- The candidate corpus is huge relative to the per-theme need. Only ~100 of 300K keywords are plausibly relevant to "Flavor builders for weeknight meals"; the remaining 299,900 are pure context overhead for LLM#2.
- Embedding similarity is a cheap router. The embedding cost is dominated by the one-time corpus embedding (pre-computed) plus a cheap per-theme query-embedding + k-NN lookup. Orders of magnitude cheaper than LLM tokens.
This is the same economic logic as cross-encoder reranking and LLM cascades, applied one level up — instead of cascading between models of different sizes, we cascade between a cheap index + a frontier LLM and a cheap index + a small LLM.
When the pattern fits¶
- The candidate space is a static-ish large corpus. Keywords, product catalogs, taxonomies, policy libraries, API schemas. Embeddings pay off when the corpus is stable enough to pre-embed.
- LLM#1 can emit natural-language concepts. If the first-stage output has to be structured against the corpus (e.g. direct product IDs), embedding retrieval fails because the bridge language doesn't exist.
- LLM#2's selection quality depends on joint reasoning over a short candidate list. If the selection is one-shot per candidate, the pattern reduces to retrieval-rerank.
When it doesn't¶
- The corpus is small enough to pass whole. If you can fit 10K candidates in context for the same cost, pruning adds operational complexity for no win.
- Embedding model is poorly aligned with the domain. Generic embeddings on specialist corpora (legal, medical, SKU taxonomies) can miss the right ~100 neighbours; recall failures show up as LLM#2 missing the right answer without evidence.
- LLM#1's freeform concepts are too narrow. If the page-design LLM's prior is too general to cover the long tail of user contexts, the k-NN query misses regions of the corpus.
Failure modes¶
- Neighbour-set recall hole. The right candidate is in the corpus but not in the top-100 — LLM#2 can't recover. Monitor recall against a ground-truth subset.
- Phase-1 concept drift. A new user segment appears that Phase-1's freeform vocabulary doesn't cover; Phase-2's candidate set is systematically wrong for that segment.
- Static-corpus staleness. New keywords / SKUs / policies added post-embedding-job aren't reachable. Re-embedding cadence is an explicit platform responsibility.
- Embedding-model version skew — if LLM#1's concept embeddings and the corpus embeddings are generated by different models or versions, neighbourhood semantics drift. See concepts/embedding-version-skew.
Relation to sibling patterns¶
| Pattern | What's cascaded | Coverage at LLM#2 |
|---|---|---|
| RAG candidate pruning (this page) | Two LLMs + embedding retrieval | ~0.03% of corpus |
| concepts/cross-encoder-reranking | Bi-encoder retriever → cross-encoder reranker | Top-K candidates |
| concepts/llm-cascade | Cheap LLM → expensive LLM on confidence gate | 100% see cheap, tail sees expensive |
| patterns/head-cache-plus-tail-finetuned-model | Head-cache → tail LLM | Head bypasses LLM entirely |
All four are "cheap-then-authoritative" cost shapes at different granularities. RAG candidate pruning is distinctive because the router is embedding similarity, not a confidence score or a cache-hit bit.
Seen in¶
- sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms — canonical wiki instance at Instacart's Phase-1→Phase-2 cascade. 300K-term keyword corpus, ~100 nearest-neighbour subset per theme, 15–20% all-in cost reduction per generation. Named by Instacart as "a core motivator for adopting a cascaded generation architecture."
Related¶
- patterns/top-down-cascaded-page-generation — the host pattern this one lives inside.
- patterns/teacher-student-model-compression — complementary Phase-2 cost lever (cheaper LLM#2 via distillation).
- concepts/retrieval-augmented-generation — the parent concept.
- concepts/vector-embedding — the retrieval mechanism.
- concepts/cascaded-llm-generation — the parent concept.
- concepts/context-engineering — the discipline this sits inside.
- systems/instacart-generative-recommendations-platform — canonical production consumer.
- companies/instacart