Skip to content

CONCEPT Cited by 1 source

LLM-based ranker

Definition

An LLM-based ranker is an architectural role in which a large language model scores or orders items from a finite candidate set rather than generating unconstrained text. The LLM sits downstream of a cheaper retriever (e.g. heuristic retrieval, lexical, vector), consuming the retriever's output and producing a ranked shortlist (top-K) of the most likely answers.

Two common output modes:

  1. Natural-language top-K. Prompt the LLM with N candidates and ask for the top-K by ID/name; parse the response.
  2. Logprob ranking. Pass each candidate (or a structured format listing all candidates) through the model using a consistent prompt template and read the token-level log-probabilities of the candidate identifier or the "is-the-root-cause" token sequence; rank by logprob.

Logprob ranking has the useful property that it produces a continuous calibrated signal (enabling confidence thresholding) rather than a brittle top-K string.

Canonical wiki reference

Meta's web-monorepo RCA system (2024-06; sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response) is the canonical instance. A fine-tuned Llama 2 (7B) ranks candidate code changes for incident root-cause identification, achieving 42% top-5 accuracy at investigation-creation time. Both output modes are used — a natural-language top-5 produced via ranking-via-election + a logprob-ranked list produced by a dedicated SFT round.

Why "ranker" and not "generator"

Three structural properties differentiate a ranker from a general-purpose LLM:

  1. Bounded output space. The ranker chooses from a known set of N candidates; it cannot hallucinate novel items. (Post-processing is usually enforced against a whitelist of candidate IDs regardless.)
  2. Calibrated scoring. Fine-tuning on a ranker-shaped dataset teaches the model to emit ordered lists with expected-root-cause at the start, enabling logprob scoring.
  3. Cheap recourse on failure. If the ranker misses (true root cause not in top-K), a human inspects the original candidate set — the retriever's output is still available. The ranker improves precision; it doesn't gate access to the data.
  • LLM-as-judge — LLM evaluates a single candidate against a rubric (binary accept/reject or score). Judge answers "is X acceptable?"; ranker answers "which of these N is best?"
  • Cross-encoder reranking — smaller Transformer encoder scores (query, candidate) pairs. More efficient than LLM ranking for large candidate sets; less able to reason about semantic/code relationships without task-specific training.
  • Full generative RAG — LLM synthesises an answer over retrieved context. Ranker is narrower: output shape is a list of candidate IDs, not prose.

Pairing with retrieval

An LLM ranker is almost always downstream of a cheaper first stage — the retrieve-then-rank-LLM pattern. The retriever's job is to cut the population from intractable (all code changes in a monorepo, millions of documents) to ranker-tractable (≤few hundred) without dropping the correct answer. The ranker's job is to produce the top-K the responder/user will actually see.

Caveats

  • Retriever recall is load-bearing. Ranker accuracy is conditional on the correct answer surviving retrieval. A retriever with poor recall caps the end-to-end accuracy the ranker can deliver.
  • Context-window pressure drives prompt structure. When N > what fits in one prompt, tournament-style ranking-via-election is one answer; logprob-per-candidate is another; sliding-window or chunked ranking are others. The choice interacts with latency + cost.
  • Self-consistency and position bias are real. LLMs can be biased toward items listed first, or produce different top-K on re-run. Mitigations include temperature=0, candidate shuffling, or averaging across multiple rank passes.
  • Ranker-SFT data is specialised. Teaching a model to rank requires examples of (candidates, correct answer, correct order) — Meta's RCA SFT set was ~5,000 instruction-tuning examples assembled from historical investigations with known root causes.

Seen in

Last updated · 319 distilled / 1,201 read