Skip to content

CONCEPT Cited by 1 source

Ranking via election

Definition

Ranking via election is a tournament-style prompt-structure pattern for applying an LLM ranker to a candidate set larger than any single prompt can hold. Instead of ranking all N candidates in one pass, the population is split into batches of size B; each batch is ranked in its own prompt; the top-K survivors from each batch are aggregated; the process repeats until only the desired number of candidates remain.

Canonical wiki reference

Meta's web-monorepo RCA system (2024-06; sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response) uses ranking-via-election with B=20, K=5:

"We structure prompts to contain a maximum of 20 changes at a time, asking the LLM to identify the top five changes. The output across the LLM requests are aggregated and the process is repeated until we have only five candidates left."

Given a retriever output of a "few hundred" candidates, the election collapses to 5 in O(log N) rounds.

Why it exists

Two constraints drive the design:

  1. Context-window budget. A Llama 2 (7B) prompt holding a few hundred code changes (each with diff + metadata) runs out of context. Meta's B=20 lets each prompt fit comfortably within the ranker's usable context.
  2. Reasoning-quality degradation with N. Even if the context fits, LLM ranking quality over large lists is lower than ranking over small ones (attention dilutes, position bias grows). Small-N prompts produce cleaner ordering.

The three round shapes

round 1:  [20 cands] → top 5   (×  k  prompts in parallel)
          [20 cands] → top 5
          ...
round 2:  aggregate all top-5s (5k candidates)
          [20 cands] → top 5   (×  k/4  prompts)
          ...
round n:  5 candidates remain  → return

Rounds are deliberately shallow — at B=20/K=5, each round cuts the population by 4×. For a starting population of 320, 3 rounds suffice (320 → 80 → 20 → 5).

Trade-offs vs alternatives

  • vs pointwise scoring. Score each candidate independently; sort. Avoids position bias but loses cross-candidate reasoning ("X is a better fit than Y because their diffs overlap").
  • vs pairwise preference. Ask the LLM "is A better than B?" over all pairs; aggregate via a tournament-style algorithm. Quadratic in N; high precision; very expensive.
  • vs listwise in one prompt. Dump all N into one prompt and ask for top-K. Limited by context window + reasoning quality at large N.
  • vs hierarchical / logprob-based ranking. Score via a dedicated logprob-producing SFT format (Meta does this alongside election). More uniform calibration; requires a second fine-tuning round.

Ranking-via-election sits at a sweet spot: cross-candidate reasoning preserved, linear rather than quadratic work, each prompt bounded in size.

Caveats

  • Shuffle candidates between rounds. Position bias in LLMs makes top-of-list more likely to survive. Shuffling candidates across rounds mitigates without eliminating this.
  • Recurse-floor effect. Candidates eliminated in round 1 never reappear; if a mistake happens early, it's permanent. Meta's retrieve-then-rank pipeline compensates with a retriever that already narrows to high-likelihood candidates before the election starts.
  • Cost multiplier. The number of LLM calls scales with ⌈N/B⌉ × log(N/K). At B=20, K=5, N=320 that's 16+4+1 = 21 ranker calls per investigation. Higher than a single-prompt approach; much cheaper than pairwise.
  • Tie-handling is underspecified. Meta does not disclose how ties are broken across prompts in the aggregation step, or how the logprob-ranked list integrates with the election output.

Generalisation

Ranking-via-election generalises beyond RCA to any problem where:

  • The candidate population is larger than one prompt can hold.
  • An LLM's cross-candidate reasoning is valuable (not replaceable by pointwise scoring).
  • The output needs to be a ranked short-list, not a single answer.

Candidates include: code-review-comment prioritisation, bug-triage queues, test-flake clustering, log-anomaly triage, and legal/policy-review pipelines where deep document comparisons outperform pairwise similarity.

Seen in

Last updated · 319 distilled / 1,201 read