Skip to content

CONCEPT Cited by 1 source

Sparse Vector

Definition

A sparse vector is a vector representation where most components are zero, stored and queried via a coordinate format ({index: value} pairs) rather than a dense array. In information retrieval, sparse vectors typically have vocabulary-sized dimensionality (tens of thousands to millions) with only a handful of non-zero components per document — one component per distinct term that appears, weighted by term frequency / importance.

This is the structural complement of a dense vector (a vector embedding like a 1024-dim Titan embedding) where nearly every component is non-zero and dimensionality is fixed small.

Why sparse vectors exist in vector databases

MongoDB's 2025-09-30 post names sparse vectors as the bridging primitive vector-first platforms use to add lexical capabilities without a second index type:

"Vector-first search platforms faced the challenge of adding lexical search. Implementing lexical search through traditional inverted indexes was often too costly due to storage differences, increased query complexity, and architectural overhead. Many adopted sparse vectors, which represent keyword importance in a way similar to traditional term-frequency methods used in lexical search. Sparse vectors were key for vector-first databases in enabling a fast integration of lexical capabilities without overhauling the core architecture."

— (MongoDB, 2025-09-30)

The key architectural insight: if a database already has an ANN index type for dense vectors, it can store sparse vectors using the same index infrastructure (with appropriate sparse-friendly similarity) and expose them via the same API — sidestepping the need to ship inverted indexes as a new index type.

Two kinds of sparse vectors

Term-frequency sparse vectors (BM25-like)

The simplest kind: one dimension per vocabulary term, component value = term's importance in the document (TF-IDF, BM25, or similar). Produced deterministically from the text without a learned model. This is essentially an in-vector representation of an inverted-index posting, and cosine-similarity over these vectors approximates BM25 scoring.

Learned sparse embeddings (SPLADE-family)

Produced by a trained model (e.g. SPLADE, ELSER) that reads the text and emits a sparse vector where non-zero components correspond to expanded terms — including synonyms, related terms, and concept tokens not literally in the source text. Captures some paraphrase robustness while preserving the sparse-vector structure. Sits conceptually between BM25 (purely lexical) and dense embeddings (purely semantic).

Properties vs dense vectors

Property Sparse vector Dense vector
Dimensionality Vocabulary size (10K–1M) Fixed small (384–4096)
Non-zero components ~dozens per document Nearly all
Storage O(non-zeros) with coordinate format O(dim) × float32
Lookup structure Posting-list-friendly ANN graph (HNSW) / IVF
Exact-term match Strong (the term is literally a dimension) Weak (semantic approximation)
Paraphrase / synonym match Weak (unless learned sparse) Strong
Explainability Per-term contributions visible Opaque
Training cost Zero for TF-based; moderate for learned sparse Embedding-model training

Trade-offs vs traditional BM25-on-inverted-index

Advantages of sparse vectors in a vector DB:

  • Single index type + single query API for both lexical and semantic retrieval.
  • Natural fit for hybrid-search fusion (sparse and dense retrieved through the same infrastructure).
  • Learned sparse embeddings (SPLADE / ELSER) can outperform BM25 on paraphrase queries while preserving the inverted-index-like structure.

Disadvantages (MongoDB's positioning angle):

  • The vector-DB sparse-vector index may be less mature than a dedicated BM25 inverted-index implementation (tokenization, stemming, stop-words, language-specific normalization, query-time expansion).
  • Advanced lexical features (phrase queries, proximity, boost fields, per-field weighting) are typically stronger in dedicated lexical engines like Lucene.
  • MongoDB argues: "if the lexical search requirements are advanced, commonly the optimal solution is served with a traditional lexical search solution coupled with vector search" — i.e. lexical-first + vector, not vector-first + sparse-vector-lexical.

Where sparse vectors are used

  • Pinecone supports sparse-dense hybrid queries via sparse vectors.
  • Weaviate offers BM25 over inverted-index as native first-class + learned-sparse via ingestion-time SPLADE models.
  • Qdrant supports sparse vectors in hybrid mode.
  • Milvus supports both dense and sparse vectors natively.
  • Elasticsearch's ELSER emits learned sparse vectors for its native hybrid search (unusual case of a lexical-first platform adopting the sparse-vector approach for a second retrieval modality on top of its own BM25).
  • MongoDB Atlas uses inverted-index-backed BM25 via Atlas Search rather than sparse vectors — consistent with its lexical-first architectural stance.

Sparse vectors are a specific implementation choice for the lexical half of hybrid retrieval. Vendors have different architectural origins:

  • Lexical-first (MongoDB, Elasticsearch, OpenSearch, Solr): BM25 on inverted index + dense vectors in a second index; fuse results.
  • Vector-first (Pinecone, Weaviate, Milvus, Qdrant): sparse vectors + dense vectors, both in the vector index; fuse results within one infrastructure.

MongoDB's framing of the distinction comes from its 2025-09-30 buyer's-guide post; the wiki treats it as one lens on the hybrid-search design space, not a universal win for either approach.

Seen in

Last updated · 200 distilled / 1,178 read