Skip to content

CONCEPT Cited by 1 source

Query preprocessing (tokenization, normalization, rewriting)

Definition

Query preprocessing is the pre-retrieval pipeline stage that transforms a raw user query into a clean form suitable for downstream retrieval. Three canonical sub-steps:

  • Tokenization — split the query string into terms (whitespace / BPE / WordPiece / language-specific).
  • Normalization — case-folding, Unicode normalization, stemming/lemmatization, stop-word handling, accent folding.
  • Query rewriting — spelling correction, synonym expansion, abbreviation expansion, intent rewrites (e.g. "how do I..." → core content), query reformulation.

The shared-stage insight

The traditional framing treats query preprocessing as a lexical-retrieval concern only — "clean the query so BM25 has better terms to match." In modern hybrid retrieval architectures, query preprocessing is a shared upstream stage feeding both the lexical and the dense-semantic arms:

  • The inverted-index arm needs normalized terms to match against its index.
  • The dense-semantic encoder also consumes the preprocessed form — the encoder's tokenizer operates on the normalized string, and consistent preprocessing across training and serving is a precondition for retrieval quality.

Canonical statement from the 2026-04-21 Meta Engineering post:

"Before retrieval, user queries undergo tokenization, normalization, and rewriting. This is important for ensuring clean inputs for both the inverted index and the embedding model."

The "for both" is the load-bearing word. Skimping on preprocessing because "the embedding model will figure it out" is a common anti-pattern; Meta explicitly locates it as the first shared stage of the hybrid pipeline.

Implications

  • Training/serving parity: if the encoder was trained on one tokenization/normalization regime, the serving pipeline must match it. Preprocessing drift across training and serving silently degrades retrieval quality.
  • Consistent input across arms: feeding a different preprocessed form to the inverted index vs the encoder causes the two arms to see different query semantics — not the goal.
  • Rewriting as a retrieval feature: query rewriting (spell-correct, synonym expansion) can move the same raw query through different retrieved documents in both arms; it's a lever, not plumbing.

Seen in

Last updated · 550 distilled / 1,221 read