CONCEPT Cited by 1 source

Speech recognition (ASR)¶

Automatic Speech Recognition (ASR) is the ML primitive that converts an audio signal of human speech into a text transcript. Traditionally, it is the first stage of voice-driven applications: audio → ASR → text → downstream task (retrieval, command dispatch, translation, dictation display, etc.).

Role as a cascade-boundary component¶

In voice search, ASR has historically played the role of the upstream stage in a two-stage cascade:

audio ──► [ASR] ──► text transcript ──► [text retriever] ──► results

The Google Research 2025-10-07 S2R post frames this exact shape as the Cascade ASR architecture, and identifies two structural problems created by putting ASR at this cascade boundary:

Information loss: ASR's output — a single text string — is strictly less expressive than its input (audio with prosody, emphasis, speaker acoustics, homophone-disambiguating context).
Error propagation: ASR's mistakes commit downstream, because the text retriever never gets to re-examine the audio (Source: sources/2025-10-07-google-speech-to-retrieval-s2r-voice-search).

The measurement of ASR quality, WER, is canonical in the field. The S2R post uses WER as the ASR-side axis and correlates it against MRR-of-retrieval to quantify the cost of imperfect ASR in voice-search terms.

Not-in-raw¶

This wiki page describes ASR only in terms of its role at the voice- search cascade boundary — the framing supplied by the S2R post. ASR architectures themselves (end-to-end encoder-decoders like Whisper, RNN-T streaming models, hybrid HMM-DNN systems, etc.), training methodology, and deployment patterns are outside the scope of the current source and not elaborated here.

Seen in¶

sources/2025-10-07-google-speech-to-retrieval-s2r-voice-search — ASR framed as the upstream stage of the Cascade ASR voice- search architecture Google's S2R is designed to replace.

Speech recognition (ASR)¶

Role as a cascade-boundary component¶

Not-in-raw¶

Seen in¶

Related¶