CONCEPT Cited by 1 source
Speech recognition (ASR)¶
Automatic Speech Recognition (ASR) is the ML primitive that converts an audio signal of human speech into a text transcript. Traditionally, it is the first stage of voice-driven applications: audio → ASR → text → downstream task (retrieval, command dispatch, translation, dictation display, etc.).
Role as a cascade-boundary component¶
In voice search, ASR has historically played the role of the upstream stage in a two-stage cascade:
The Google Research 2025-10-07 S2R post frames this exact shape as the Cascade ASR architecture, and identifies two structural problems created by putting ASR at this cascade boundary:
- Information loss: ASR's output — a single text string — is strictly less expressive than its input (audio with prosody, emphasis, speaker acoustics, homophone-disambiguating context).
- Error propagation: ASR's mistakes commit downstream, because the text retriever never gets to re-examine the audio (Source: sources/2025-10-07-google-speech-to-retrieval-s2r-voice-search).
The measurement of ASR quality, WER, is canonical in the field. The S2R post uses WER as the ASR-side axis and correlates it against MRR-of-retrieval to quantify the cost of imperfect ASR in voice-search terms.
Not-in-raw¶
This wiki page describes ASR only in terms of its role at the voice- search cascade boundary — the framing supplied by the S2R post. ASR architectures themselves (end-to-end encoder-decoders like Whisper, RNN-T streaming models, hybrid HMM-DNN systems, etc.), training methodology, and deployment patterns are outside the scope of the current source and not elaborated here.
Seen in¶
- sources/2025-10-07-google-speech-to-retrieval-s2r-voice-search — ASR framed as the upstream stage of the Cascade ASR voice- search architecture Google's S2R is designed to replace.