PATTERN Cited by 2 sources
Retrieve-then-rank with an LLM¶
Summary¶
A two-stage cascaded-inference pattern for applying LLMs to large candidate populations:
- Stage 1 — retriever. A cheap, high-recall primitive (heuristics, lexical, vector, or hybrid) narrows an intractably large population to a rank-tractable set — typically 10-1000 candidates.
- Stage 2 — LLM ranker. An LLM scores/orders the narrowed set and produces a ranked short-list (top-K) to present to the user or downstream system.
The asymmetric cost structure — retriever runs on every request, LLM runs only on a narrowed set — is what makes LLM-based ranking affordable at production request volume.
Canonical wiki reference¶
Meta's web-monorepo RCA system (2024-06; sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response) is the canonical instance:
- Retriever. Heuristic retrieval via code + directory ownership and runtime code-graph exploration of impacted systems. Narrows "thousands of changes to a few hundred."
- Ranker. Fine-tuned Llama 2 (7B) running ranking-via-election (B=20, K=5) to collapse few-hundred → top-5.
- Outcome. 42% top-5 accuracy at investigation-creation time on backtested historical investigations.
Why cascade¶
Three structural reasons the two-stage split out-performs either stage alone:
- Cost. LLMs are 2-4 orders of magnitude more expensive per inference than retrievers. Running the LLM over every candidate is prohibitive; running it over a pre-narrowed set is affordable.
- Latency. Retrievers are millisecond; LLM calls are seconds. Narrowing before the LLM call keeps end-to-end latency acceptable.
- Quality floor + ceiling. The retriever's recall is the ceiling on end-to-end accuracy (the correct answer must survive it). The ranker's precision is the ceiling on how cleanly the top-K isolates the correct answer from the narrowed set. Both must meet their respective bars independently.
The retriever choice space¶
- Heuristic — domain rules (ownership, code graph, time windows). Fast, interpretable, limited to encoded knowledge. Meta RCA picks this.
- Lexical (BM25) — term-frequency scoring. Fast, interpretable, limited to surface-level keyword match. Canonical for text search.
- Vector + ANN — learned embeddings + approximate nearest-neighbour search. Handles semantic similarity; needs embedding infra.
- Hybrid — combined lexical + vector. Industry default for document search.
Choice is domain-driven. Monorepo RCA has structured ownership + code graph (heuristics win); open-domain document search benefits from hybrid.
The ranker choice space¶
- LLM natural-language top-K — prompt the LLM with N candidates, parse top-K from response. Meta RCA's primary path, via concepts/ranking-via-election for N > one-prompt capacity.
- LLM logprobs. Score each candidate under a fixed prompt template; rank by logprob. Produces calibrated continuous scores for confidence thresholding. Meta RCA's secondary path via a dedicated SFT round.
- Cross-encoder. Smaller Transformer scoring (query, candidate). Cheaper than LLM; less reasoning capacity. Canonical for document search reranking.
- Pointwise classifier. Small domain-trained model outputs score per (query, candidate). Cheapest; weakest.
How to decide top-K and batch size¶
- K is driven by downstream consumption. Human reviewers can scan ~5-10 items before fatigue; automated downstreams may take the top-1.
- Batch size (for ranking-via-election) is driven by ranker context window + reasoning quality per batch. Meta uses B=20; expect B=10-50 for modern LLMs depending on per-candidate context size.
Caveats¶
- Retriever recall is the ceiling. If the correct answer is filtered out in stage 1, the ranker cannot recover. Retriever recall must be measured and tracked as a first-class metric.
- Data + format drift. SFT'd rankers and heuristic retrievers both need refresh as the underlying distribution shifts (new codebase structure, new change patterns, new incident types).
- Cost asymmetry can be misleading. While per-call the LLM is expensive, number of LLM calls in ranking-via-election scales with ⌈N/B⌉ × log(N/K). Cost grows quickly as the retriever's output grows; keep retriever output tight.
- Cascading failure modes. A bug in the retriever (e.g. a missing ownership rule) can systematically bias the candidate set in a way the ranker cannot detect. End-to-end evaluation against ground truth catches this; unit-testing each stage does not.
Seen in¶
- sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response — canonical RCA instance (heuristic retriever + Llama-2 ranker + ranking-via-election).
- sources/2026-03-06-pinterest-unified-context-intent-embeddings-for-scalable-text-to-sql — Pinterest's Analytics Agent applies retrieve-then-rank to the Text-to-SQL task over query history. Stage 1: analytical intent retrieval — embed user question into the same unified context-intent embedding space as past-query descriptions, top-k by cosine. Stage 2: governance-tier ranking fusion — re-rank candidates by fusing similarity scores with tier + freshness + documentation + ownership + query-success signals. A new wiki instance of the pattern where the retriever is a learned-intent embedding index and the ranker is a rule-based fusion function — structurally distinct from both Meta's RCA shape (heuristic retrieval
- LLM ranker) and the generic document-search shape (hybrid + cross-encoder). Shows the pattern is robust to substituting either stage independently.
Related¶
- concepts/llm-based-ranker — the stage-2 role.
- concepts/heuristic-retrieval — the stage-1 option Meta picked.
- concepts/hybrid-retrieval-bm25-vectors — the document-search stage-1 alternative.
- concepts/ranking-via-election — the prompt-structure primitive for N > one-prompt.
- concepts/cross-encoder-reranking — the cheaper stage-2 alternative for large N.
- concepts/llm-cascade — the sibling cascade pattern at model-size level (small → large).
- concepts/governance-aware-ranking — Pinterest's rule-based stage-2 variant.
- patterns/analytical-intent-retrieval — Pinterest's stage-1 variant.
- patterns/governance-tier-ranking-fusion — Pinterest's stage-2 implementation.
- patterns/closed-feedback-loop-ai-features — safety discipline that pairs with this architecture.
- systems/meta-rca-system — canonical Meta instance.
- systems/pinterest-analytics-agent — canonical Pinterest instance.