Skip to content

CONCEPT Cited by 1 source

Sparse lexical retrieval

Definition

Sparse lexical retrieval is the class of information-retrieval approaches that match a query to documents by exact or closely-matched terms stored in an inverted index, producing a sparse feature vector per document. Canonical scoring functions include TF-IDF and BM25.

The vector is "sparse" because the dimensionality is the vocabulary size and most documents score zero on most terms.

Role in modern hybrid retrieval

In hybrid-retrieval architectures (cf concepts/hybrid-retrieval-bm25-vectors), sparse lexical retrieval is the parallel counterpart to dense semantic retrieval:

  • Strengths: high precision on proper nouns, specific quotes, acronyms, exact term matches; interpretable; cheap to index; cheap to score.
  • Weaknesses: misses paraphrase, synonym, and cross-vocabulary matches (user query "Italian coffee drink" misses post "cappuccino" if "coffee" is never written).

These weaknesses are the forcing function for hybrid retrieval — not a condemnation of lexical retrieval. Pairing sparse lexical with dense semantic lets each cover the other's weakness.

Meta Groups Scoped Search instance

The 2026-04-21 Meta Engineering post names Unicorn — Facebook's in-house inverted-index system since the 2013 Graph Search era — as the sparse-lexical-retrieval arm of the modernized Groups scoped search pipeline:

"We utilize Facebook's Unicorn inverted index to fetch posts containing exact or closely matched terms. This ensures high precision for queries involving proper nouns or specific quotes."

The lexical arm is not deprecated by the move to hybrid — it runs in parallel with the dense SSR + Faiss arm, and its outputs contribute TF-IDF and BM25 features to the L2 MTML ranker.

Adjacent concepts

Seen in

Last updated · 550 distilled / 1,221 read