SYSTEM Cited by 2 sources

BM25¶

Definition¶

BM25 (Best Matching 25, Okapi BM25) is the standard lexical retrieval scoring algorithm — a probabilistic extension of TF-IDF with saturating term-frequency and length normalization. Given a query and a document, BM25 scores the document by how "surprising" the query terms are as a bag in that document versus the corpus.

It is the default ranking function in Elasticsearch, OpenSearch, Apache Lucene and Solr; widely treated as the canonical lexical baseline for IR.

Why it still matters in the agent era¶

A lot of modern RAG discussion frames BM25 as a legacy fallback once dense-vector retrieval is in place. Production systems tend to disagree.

Dropbox Dash explicitly keeps BM25 as the primary lexical surface alongside dense vectors in a hybrid index, and frames its role in unambiguous terms:

"Today we use both a lexical index—using BM25—and then store everything as dense vectors in a vector store. While this allows us to do hybrid retrieval, we found BM25 was very effective on its own with some relevant signals. It's an amazing workhorse for building out an index."

— Josh Clemm, Dropbox Dash (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash)

Strengths¶

Exact-term matching. Acronyms, proper nouns, IDs — where embedding-similarity can drift and BM25 does not.
No training. Parameters (k1, b) are defaults in most engines; no embedding model to maintain / re-train / re-index against.
Fast, memory-efficient. Inverted indexes are decades of well-optimized engineering.
Explainable. Per-term contributions are visible; ranking decisions are auditable.
No embedding drift. Upgrading an embedding model requires re-indexing the corpus; BM25 does not.

Limitations¶

Paraphrase / synonym blind spot. "Bought a bicycle" vs "purchased a bike" — BM25 sees different tokens. Dense vectors cover this case.
Term weighting static. Can't learn per-user or per-query relevance.
Stop-word + stemming fragility. Different languages / tokenizers yield different results; tuning per-language is real work.

Typical production role¶

See concepts/hybrid-retrieval-bm25-vectors for the full discussion. Summary:

BM25 + dense vector retrieval, both querying in parallel.
Score fusion (reciprocal rank fusion, weighted sum, or a learned re-ranker).
BM25 carries exact-term / acronym queries; vectors carry paraphrase.
Knowledge-graph signals (in Dash's case) layered on top as a ranking input, not a replacement.

Seen in¶

sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash — Dropbox Dash running BM25 as the lexical index paired with dense vectors; "amazing workhorse" framing.
sources/2026-04-16-cloudflare-ai-search-the-search-primitive-for-your-agents — Cloudflare AI Search promotes BM25 to a first-class config knob on managed hybrid-search instances: index_method: { keyword: true, vector: true }, indexing_options.keyword_tokenizer: "porter" | "trigram" (Porter for natural-language docs, trigram for code / partial matches), retrieval_options.keyword_match_mode: "and" | "or". Cloudflare's framing: "BM25 scores documents by how often your query terms appear, how rare those terms are across the entire corpus, and how long the document is. It rewards matches on specific terms, penalizes common filler words, and normalizes for document length." The "ERR_CONNECTION_REFUSED timeout" worked example is the 2026 canonical illustration of why BM25 survives alongside dense vectors in production retrieval stacks — vector loses the specific error-string, BM25 finds it.

concepts/hybrid-retrieval-bm25-vectors — how BM25 fits into modern retrieval stacks.
systems/dash-search-index — production instance consuming BM25.
systems/dropbox-dash — Dropbox product keyed on this retrieval stack.