Skip to content

CONCEPT Cited by 1 source

Word Error Rate (WER)

Word Error Rate (WER) is the canonical metric for automatic speech recognition quality:

WER = (S + D + I) / N

where S is substitutions, D is deletions, I is insertions between the ASR hypothesis and the reference transcript, and N is the total words in the reference. Lower is better; a perfect ASR on a test set has WER = 0. It is the word-granularity analogue of edit distance, normalised by reference length.

(Defined externally; the S2R post links to https://en.wikipedia.org/wiki/Word_error_rate.)

Role in the S2R benchmark

In Google Research's Speech-to- Retrieval post (2025-10-07), WER is the ASR-quality axis of the core motivating chart: per language in the SVQ dataset, WER is plotted against the MRR gap between real Cascade ASR and a human-transcribed "Cascade groundtruth" oracle. The post uses this correlation to show that ASR errors directly degrade retrieval quality โ€” the motivation for collapsing the cascade entirely with S2R (Source: sources/2025-10-07-google-speech-to-retrieval-s2r-voice-search).

Note: the raw capture confirms that WER is used as the ASR-side axis but does not report the numeric values per language (the chart is referenced but the numbers live in the unscraped body).

Caveats of WER

  • Word-count-sensitive: short queries amplify each error.
  • Semantic-insensitive: a substitution of a function word costs the same as a substitution of a content word, though content- word mistakes hurt retrieval far more.
  • Doesn't capture homophone-level errors systematically: "flower" vs "flour" may or may not be penalised depending on the reference's tokenisation conventions.
  • Doesn't translate linearly to downstream metrics: a 10% WER reduction doesn't imply a 10% improvement in retrieval MRR โ€” the relationship depends on which words were misrecognised and how central they were to the user's intent.

The S2R post's pairing of WER with MRR (and with human rater judgement) is specifically designed to bridge WER's locality-to-the- transcript view with retrieval quality's locality-to-the-user-goal view.

Seen in

Last updated ยท 200 distilled / 1,178 read