PATTERN Cited by 2 sources

Retrieve-then-rank with an LLM¶

Summary¶

A two-stage cascaded-inference pattern for applying LLMs to large candidate populations:

Stage 1 — retriever. A cheap, high-recall primitive (heuristics, lexical, vector, or hybrid) narrows an intractably large population to a rank-tractable set — typically 10-1000 candidates.
Stage 2 — LLM ranker. An LLM scores/orders the narrowed set and produces a ranked short-list (top-K) to present to the user or downstream system.

The asymmetric cost structure — retriever runs on every request, LLM runs only on a narrowed set — is what makes LLM-based ranking affordable at production request volume.

Canonical wiki reference¶

Meta's web-monorepo RCA system (2024-06; sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response) is the canonical instance:

Retriever. Heuristic retrieval via code + directory ownership and runtime code-graph exploration of impacted systems. Narrows "thousands of changes to a few hundred."
Ranker. Fine-tuned Llama 2 (7B) running ranking-via-election (B=20, K=5) to collapse few-hundred → top-5.
Outcome. 42% top-5 accuracy at investigation-creation time on backtested historical investigations.

Why cascade¶

Three structural reasons the two-stage split out-performs either stage alone:

Cost. LLMs are 2-4 orders of magnitude more expensive per inference than retrievers. Running the LLM over every candidate is prohibitive; running it over a pre-narrowed set is affordable.
Latency. Retrievers are millisecond; LLM calls are seconds. Narrowing before the LLM call keeps end-to-end latency acceptable.
Quality floor + ceiling. The retriever's recall is the ceiling on end-to-end accuracy (the correct answer must survive it). The ranker's precision is the ceiling on how cleanly the top-K isolates the correct answer from the narrowed set. Both must meet their respective bars independently.

The retriever choice space¶

Heuristic — domain rules (ownership, code graph, time windows). Fast, interpretable, limited to encoded knowledge. Meta RCA picks this.
Lexical (BM25) — term-frequency scoring. Fast, interpretable, limited to surface-level keyword match. Canonical for text search.
Vector + ANN — learned embeddings + approximate nearest-neighbour search. Handles semantic similarity; needs embedding infra.
Hybrid — combined lexical + vector. Industry default for document search.

Choice is domain-driven. Monorepo RCA has structured ownership + code graph (heuristics win); open-domain document search benefits from hybrid.

The ranker choice space¶

LLM natural-language top-K — prompt the LLM with N candidates, parse top-K from response. Meta RCA's primary path, via concepts/ranking-via-election for N > one-prompt capacity.
LLM logprobs. Score each candidate under a fixed prompt template; rank by logprob. Produces calibrated continuous scores for confidence thresholding. Meta RCA's secondary path via a dedicated SFT round.
Cross-encoder. Smaller Transformer scoring (query, candidate). Cheaper than LLM; less reasoning capacity. Canonical for document search reranking.
Pointwise classifier. Small domain-trained model outputs score per (query, candidate). Cheapest; weakest.

How to decide top-K and batch size¶

K is driven by downstream consumption. Human reviewers can scan ~5-10 items before fatigue; automated downstreams may take the top-1.
Batch size (for ranking-via-election) is driven by ranker context window + reasoning quality per batch. Meta uses B=20; expect B=10-50 for modern LLMs depending on per-candidate context size.

Caveats¶

Retriever recall is the ceiling. If the correct answer is filtered out in stage 1, the ranker cannot recover. Retriever recall must be measured and tracked as a first-class metric.
Data + format drift. SFT'd rankers and heuristic retrievers both need refresh as the underlying distribution shifts (new codebase structure, new change patterns, new incident types).
Cost asymmetry can be misleading. While per-call the LLM is expensive, number of LLM calls in ranking-via-election scales with ⌈N/B⌉ × log(N/K). Cost grows quickly as the retriever's output grows; keep retriever output tight.
Cascading failure modes. A bug in the retriever (e.g. a missing ownership rule) can systematically bias the candidate set in a way the ranker cannot detect. End-to-end evaluation against ground truth catches this; unit-testing each stage does not.

Seen in¶

sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response — canonical RCA instance (heuristic retriever + Llama-2 ranker + ranking-via-election).
sources/2026-03-06-pinterest-unified-context-intent-embeddings-for-scalable-text-to-sql — Pinterest's Analytics Agent applies retrieve-then-rank to the Text-to-SQL task over query history. Stage 1: analytical intent retrieval — embed user question into the same unified context-intent embedding space as past-query descriptions, top-k by cosine. Stage 2: governance-tier ranking fusion — re-rank candidates by fusing similarity scores with tier + freshness + documentation + ownership + query-success signals. A new wiki instance of the pattern where the retriever is a learned-intent embedding index and the ranker is a rule-based fusion function — structurally distinct from both Meta's RCA shape (heuristic retrieval
LLM ranker) and the generic document-search shape (hybrid + cross-encoder). Shows the pattern is robust to substituting either stage independently.

concepts/llm-based-ranker — the stage-2 role.
concepts/heuristic-retrieval — the stage-1 option Meta picked.
concepts/hybrid-retrieval-bm25-vectors — the document-search stage-1 alternative.
concepts/ranking-via-election — the prompt-structure primitive for N > one-prompt.
concepts/cross-encoder-reranking — the cheaper stage-2 alternative for large N.
concepts/llm-cascade — the sibling cascade pattern at model-size level (small → large).
concepts/governance-aware-ranking — Pinterest's rule-based stage-2 variant.
patterns/analytical-intent-retrieval — Pinterest's stage-1 variant.
patterns/governance-tier-ranking-fusion — Pinterest's stage-2 implementation.
patterns/closed-feedback-loop-ai-features — safety discipline that pairs with this architecture.
systems/meta-rca-system — canonical Meta instance.
systems/pinterest-analytics-agent — canonical Pinterest instance.