Skip to content

SYSTEM Cited by 1 source

SVQ (Simulated Voice Queries) dataset

SVQ (Simulated Voice Queries) is a multilingual voice-search evaluation dataset produced by Google Research, used as the evaluation substrate for the 2025-10-07 S2R benchmark (Source: sources/2025-10-07-google-speech-to-retrieval-s2r-voice-search).

The raw capture confirms:

  • Multilingual coverage — SVQ spans "some of the most commonly used voice search languages."
  • Human-transcribed groundtruth — queries in SVQ are manually transcribed by human annotators, providing the "perfect ASR" reference for the benchmark's groundtruth cascade leg.
  • Representative of voice-search traffic — explicitly "a representative set of test queries reflecting typical voice search traffic."

The per-language WER-vs-MRR chart in the S2R post is the headline experimental artefact derived from SVQ, though the raw markdown does not include the numeric values per language.

What's not specified in raw

  • SVQ's total query count.
  • Per-language query counts.
  • Licensing / availability (internal-only vs public).
  • Audio source (synthetic TTS vs real user recordings vs both; "simulated" in the name is suggestive but not specified).
  • Any comparison to prior voice-search benchmarks (e.g. Voice Search LM datasets, SLURP).

These would require reading the full post body or any linked dataset/paper.

Seen in

Last updated · 200 distilled / 1,178 read