Skip to content

PATTERN Cited by 2 sources

Retrieve-then-rank with an LLM

Summary

A two-stage cascaded-inference pattern for applying LLMs to large candidate populations:

  1. Stage 1 — retriever. A cheap, high-recall primitive (heuristics, lexical, vector, or hybrid) narrows an intractably large population to a rank-tractable set — typically 10-1000 candidates.
  2. Stage 2 — LLM ranker. An LLM scores/orders the narrowed set and produces a ranked short-list (top-K) to present to the user or downstream system.

The asymmetric cost structure — retriever runs on every request, LLM runs only on a narrowed set — is what makes LLM-based ranking affordable at production request volume.

Canonical wiki reference

Meta's web-monorepo RCA system (2024-06; sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response) is the canonical instance:

  • Retriever. Heuristic retrieval via code + directory ownership and runtime code-graph exploration of impacted systems. Narrows "thousands of changes to a few hundred."
  • Ranker. Fine-tuned Llama 2 (7B) running ranking-via-election (B=20, K=5) to collapse few-hundred → top-5.
  • Outcome. 42% top-5 accuracy at investigation-creation time on backtested historical investigations.

Why cascade

Three structural reasons the two-stage split out-performs either stage alone:

  1. Cost. LLMs are 2-4 orders of magnitude more expensive per inference than retrievers. Running the LLM over every candidate is prohibitive; running it over a pre-narrowed set is affordable.
  2. Latency. Retrievers are millisecond; LLM calls are seconds. Narrowing before the LLM call keeps end-to-end latency acceptable.
  3. Quality floor + ceiling. The retriever's recall is the ceiling on end-to-end accuracy (the correct answer must survive it). The ranker's precision is the ceiling on how cleanly the top-K isolates the correct answer from the narrowed set. Both must meet their respective bars independently.

The retriever choice space

  • Heuristic — domain rules (ownership, code graph, time windows). Fast, interpretable, limited to encoded knowledge. Meta RCA picks this.
  • Lexical (BM25) — term-frequency scoring. Fast, interpretable, limited to surface-level keyword match. Canonical for text search.
  • Vector + ANN — learned embeddings + approximate nearest-neighbour search. Handles semantic similarity; needs embedding infra.
  • Hybrid — combined lexical + vector. Industry default for document search.

Choice is domain-driven. Monorepo RCA has structured ownership + code graph (heuristics win); open-domain document search benefits from hybrid.

The ranker choice space

  • LLM natural-language top-K — prompt the LLM with N candidates, parse top-K from response. Meta RCA's primary path, via concepts/ranking-via-election for N > one-prompt capacity.
  • LLM logprobs. Score each candidate under a fixed prompt template; rank by logprob. Produces calibrated continuous scores for confidence thresholding. Meta RCA's secondary path via a dedicated SFT round.
  • Cross-encoder. Smaller Transformer scoring (query, candidate). Cheaper than LLM; less reasoning capacity. Canonical for document search reranking.
  • Pointwise classifier. Small domain-trained model outputs score per (query, candidate). Cheapest; weakest.

How to decide top-K and batch size

  • K is driven by downstream consumption. Human reviewers can scan ~5-10 items before fatigue; automated downstreams may take the top-1.
  • Batch size (for ranking-via-election) is driven by ranker context window + reasoning quality per batch. Meta uses B=20; expect B=10-50 for modern LLMs depending on per-candidate context size.

Caveats

  • Retriever recall is the ceiling. If the correct answer is filtered out in stage 1, the ranker cannot recover. Retriever recall must be measured and tracked as a first-class metric.
  • Data + format drift. SFT'd rankers and heuristic retrievers both need refresh as the underlying distribution shifts (new codebase structure, new change patterns, new incident types).
  • Cost asymmetry can be misleading. While per-call the LLM is expensive, number of LLM calls in ranking-via-election scales with ⌈N/B⌉ × log(N/K). Cost grows quickly as the retriever's output grows; keep retriever output tight.
  • Cascading failure modes. A bug in the retriever (e.g. a missing ownership rule) can systematically bias the candidate set in a way the ranker cannot detect. End-to-end evaluation against ground truth catches this; unit-testing each stage does not.

Seen in

Last updated · 319 distilled / 1,201 read