Skip to content

CONCEPT Cited by 1 source

Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank (MRR) is a statistical metric for evaluating any process that produces a ranked list of candidate responses to a query, where each query has a designated "correct" answer. It is the average of the reciprocals of the rank of the first correct answer across all queries in the test set:

MRR = (1/|Q|) * Σ_q (1 / rank_q)

where |Q| is the total query count and rank_q is the 1-based position of the first correct answer for query q (undefined / counted as 0 if no correct answer appears in the returned list). MRR = 1.0 means every query's first correct answer is at rank 1; MRR = 0.5 means on average the correct answer sits at rank 2; etc.

(Defined externally; the S2R post links to https://en.wikipedia.org/wiki/Mean_reciprocal_rank.)

Role in the S2R benchmark

In Google Research's Speech-to- Retrieval post (2025-10-07), MRR is the retrieval-quality axis of the core motivating experiment. Two systems are run over the same query set:

  • Cascade ASR — real-world voice-search pipeline (audio → real ASR → text retriever).
  • Cascade groundtruth — the ASR step is replaced by human transcription ("perfect ASR"), so the retriever's quality is bounded only by itself.

The difference in MRR between Cascade groundtruth and Cascade ASR is a direct measurement of the retrieval-quality cost of real- world ASR errors — and therefore a lower bound on the improvement S2R (which removes the cascade entirely) must clear to justify the architectural change (Source: sources/2025-10-07-google-speech-to-retrieval-s2r-voice-search).

The post correlates this per-language MRR gap with WER on the SVQ dataset to establish that ASR-quality variation drives most of the retrieval-quality variation across languages.

Caveats of MRR

  • Only the first correct answer matters: a query where the correct answer is at rank 1 scores 1.0; at rank 2 scores 0.5 — a large drop. A query where three near-correct answers sit at ranks 1–3 and the exact correct answer at rank 4 scores 0.25. The metric is harsh on near-misses and indifferent to how many correct answers are returned overall (contrast with nDCG / precision@K).
  • Requires a unique designated correct answer: queries with multiple legitimate answers are either collapsed to one (losing signal) or scored against whichever of their acceptable answers appears first (rewarding retrievers that diverge from the rater's choice).
  • Binary per-query: the reciprocal-rank is binary-conditioned on "did a correct answer appear"; partial-credit for nearly-correct retrievals is absent.

The S2R post's cross-check with human raters (subjectively comparing the two systems' returned documents alongside the true query) is the guard-rail against MRR-overfitting: if the MRR gap closes but the raters still prefer the groundtruth system's results, the MRR isn't capturing the user-relevant signal.

Seen in

Last updated · 200 distilled / 1,178 read