CONCEPT Cited by 1 source

Mean Reciprocal Rank (MRR)¶

Mean Reciprocal Rank (MRR) is a statistical metric for evaluating any process that produces a ranked list of candidate responses to a query, where each query has a designated "correct" answer. It is the average of the reciprocals of the rank of the first correct answer across all queries in the test set:

MRR = (1/|Q|) * Σ_q (1 / rank_q)

where |Q| is the total query count and rank_q is the 1-based position of the first correct answer for query q (undefined / counted as 0 if no correct answer appears in the returned list). MRR = 1.0 means every query's first correct answer is at rank 1; MRR = 0.5 means on average the correct answer sits at rank 2; etc.

(Defined externally; the S2R post links to https://en.wikipedia.org/wiki/Mean_reciprocal_rank.)

Role in the S2R benchmark¶

In Google Research's Speech-to- Retrieval post (2025-10-07), MRR is the retrieval-quality axis of the core motivating experiment. Two systems are run over the same query set:

Cascade ASR — real-world voice-search pipeline (audio → real ASR → text retriever).
Cascade groundtruth — the ASR step is replaced by human transcription ("perfect ASR"), so the retriever's quality is bounded only by itself.

The difference in MRR between Cascade groundtruth and Cascade ASR is a direct measurement of the retrieval-quality cost of real- world ASR errors — and therefore a lower bound on the improvement S2R (which removes the cascade entirely) must clear to justify the architectural change (Source: sources/2025-10-07-google-speech-to-retrieval-s2r-voice-search).

The post correlates this per-language MRR gap with WER on the SVQ dataset to establish that ASR-quality variation drives most of the retrieval-quality variation across languages.

Caveats of MRR¶

Only the first correct answer matters: a query where the correct answer is at rank 1 scores 1.0; at rank 2 scores 0.5 — a large drop. A query where three near-correct answers sit at ranks 1–3 and the exact correct answer at rank 4 scores 0.25. The metric is harsh on near-misses and indifferent to how many correct answers are returned overall (contrast with nDCG / precision@K).
Requires a unique designated correct answer: queries with multiple legitimate answers are either collapsed to one (losing signal) or scored against whichever of their acceptable answers appears first (rewarding retrievers that diverge from the rater's choice).
Binary per-query: the reciprocal-rank is binary-conditioned on "did a correct answer appear"; partial-credit for nearly-correct retrievals is absent.

The S2R post's cross-check with human raters (subjectively comparing the two systems' returned documents alongside the true query) is the guard-rail against MRR-overfitting: if the MRR gap closes but the raters still prefer the groundtruth system's results, the MRR isn't capturing the user-relevant signal.

Seen in¶

sources/2025-10-07-google-speech-to-retrieval-s2r-voice-search — MRR used as the retrieval-quality axis to measure the cost of imperfect ASR in the Cascade ASR vs Cascade groundtruth benchmark; paired with WER and with human-rater judgement.

Mean Reciprocal Rank (MRR)¶

Role in the S2R benchmark¶

Caveats of MRR¶

Seen in¶

Related¶