CONCEPT Cited by 1 source
Sparse Vector¶
Definition¶
A sparse vector is a vector representation where most components are zero, stored and queried via a coordinate format ({index: value} pairs) rather than a dense array. In information retrieval, sparse vectors typically have vocabulary-sized dimensionality (tens of thousands to millions) with only a handful of non-zero components per document — one component per distinct term that appears, weighted by term frequency / importance.
This is the structural complement of a dense vector (a vector embedding like a 1024-dim Titan embedding) where nearly every component is non-zero and dimensionality is fixed small.
Why sparse vectors exist in vector databases¶
MongoDB's 2025-09-30 post names sparse vectors as the bridging primitive vector-first platforms use to add lexical capabilities without a second index type:
"Vector-first search platforms faced the challenge of adding lexical search. Implementing lexical search through traditional inverted indexes was often too costly due to storage differences, increased query complexity, and architectural overhead. Many adopted sparse vectors, which represent keyword importance in a way similar to traditional term-frequency methods used in lexical search. Sparse vectors were key for vector-first databases in enabling a fast integration of lexical capabilities without overhauling the core architecture."
The key architectural insight: if a database already has an ANN index type for dense vectors, it can store sparse vectors using the same index infrastructure (with appropriate sparse-friendly similarity) and expose them via the same API — sidestepping the need to ship inverted indexes as a new index type.
Two kinds of sparse vectors¶
Term-frequency sparse vectors (BM25-like)¶
The simplest kind: one dimension per vocabulary term, component value = term's importance in the document (TF-IDF, BM25, or similar). Produced deterministically from the text without a learned model. This is essentially an in-vector representation of an inverted-index posting, and cosine-similarity over these vectors approximates BM25 scoring.
Learned sparse embeddings (SPLADE-family)¶
Produced by a trained model (e.g. SPLADE, ELSER) that reads the text and emits a sparse vector where non-zero components correspond to expanded terms — including synonyms, related terms, and concept tokens not literally in the source text. Captures some paraphrase robustness while preserving the sparse-vector structure. Sits conceptually between BM25 (purely lexical) and dense embeddings (purely semantic).
Properties vs dense vectors¶
| Property | Sparse vector | Dense vector |
|---|---|---|
| Dimensionality | Vocabulary size (10K–1M) | Fixed small (384–4096) |
| Non-zero components | ~dozens per document | Nearly all |
| Storage | O(non-zeros) with coordinate format |
O(dim) × float32 |
| Lookup structure | Posting-list-friendly | ANN graph (HNSW) / IVF |
| Exact-term match | Strong (the term is literally a dimension) | Weak (semantic approximation) |
| Paraphrase / synonym match | Weak (unless learned sparse) | Strong |
| Explainability | Per-term contributions visible | Opaque |
| Training cost | Zero for TF-based; moderate for learned sparse | Embedding-model training |
Trade-offs vs traditional BM25-on-inverted-index¶
Advantages of sparse vectors in a vector DB:
- Single index type + single query API for both lexical and semantic retrieval.
- Natural fit for hybrid-search fusion (sparse and dense retrieved through the same infrastructure).
- Learned sparse embeddings (SPLADE / ELSER) can outperform BM25 on paraphrase queries while preserving the inverted-index-like structure.
Disadvantages (MongoDB's positioning angle):
- The vector-DB sparse-vector index may be less mature than a dedicated BM25 inverted-index implementation (tokenization, stemming, stop-words, language-specific normalization, query-time expansion).
- Advanced lexical features (phrase queries, proximity, boost fields, per-field weighting) are typically stronger in dedicated lexical engines like Lucene.
- MongoDB argues: "if the lexical search requirements are advanced, commonly the optimal solution is served with a traditional lexical search solution coupled with vector search" — i.e. lexical-first + vector, not vector-first + sparse-vector-lexical.
Where sparse vectors are used¶
- Pinecone supports sparse-dense hybrid queries via sparse vectors.
- Weaviate offers BM25 over inverted-index as native first-class + learned-sparse via ingestion-time SPLADE models.
- Qdrant supports sparse vectors in hybrid mode.
- Milvus supports both dense and sparse vectors natively.
- Elasticsearch's ELSER emits learned sparse vectors for its native hybrid search (unusual case of a lexical-first platform adopting the sparse-vector approach for a second retrieval modality on top of its own BM25).
- MongoDB Atlas uses inverted-index-backed BM25 via Atlas Search rather than sparse vectors — consistent with its lexical-first architectural stance.
Relation to hybrid search¶
Sparse vectors are a specific implementation choice for the lexical half of hybrid retrieval. Vendors have different architectural origins:
- Lexical-first (MongoDB, Elasticsearch, OpenSearch, Solr): BM25 on inverted index + dense vectors in a second index; fuse results.
- Vector-first (Pinecone, Weaviate, Milvus, Qdrant): sparse vectors + dense vectors, both in the vector index; fuse results within one infrastructure.
MongoDB's framing of the distinction comes from its 2025-09-30 buyer's-guide post; the wiki treats it as one lens on the hybrid-search design space, not a universal win for either approach.
Seen in¶
- sources/2025-09-30-mongodb-top-considerations-when-choosing-a-hybrid-search-solution — MongoDB names sparse vectors as "key for vector-first databases in enabling a fast integration of lexical capabilities without overhauling the core architecture", positioning them as a bridging primitive with less-mature keyword capabilities than traditional inverted-index BM25.
Related¶
- concepts/vector-embedding — the dense counterpart; same "vector" word, very different data shape.
- concepts/hybrid-retrieval-bm25-vectors — the retrieval-pipeline pattern sparse vectors participate in.
- concepts/vector-similarity-search — the operation sparse vectors are queried by (with similarity functions adapted to sparseness).
- systems/bm25 — the inverted-index lexical primitive sparse vectors functionally approximate; MongoDB's positioning is that bespoke BM25 wins on advanced lexical features vs sparse-vector lexical.
- systems/atlas-hybrid-search — MongoDB's counter-example — lexical-first via native BM25 (Atlas Search), not sparse vectors.