SYSTEM Cited by 4 sources

Voyage AI¶

Voyage AI is MongoDB's embedding-and-reranking model line, originally founded by Stanford's Tengyu Ma (and team) and acquired by MongoDB in 2025 to form the native embedding- generation and reranking layer for Atlas Vector Search.

Products¶

Embedding models — the voyage-3 family (systems/voyage-3-large, voyage-3, voyage-3-lite, voyage-3-xl, domain-specialised voyage- law / voyage-finance / voyage-code / voyage-multilingual) served as a hosted embedding API on the Voyage platform and integrated into MongoDB Atlas.
Reranking models — cross-encoder rerankers (rerank-2, rerank- 2-lite and successors) designed to sit on top of hybrid-search first-stage retrieval.
Query-vs-document-aware serving — explicit distinction between query embeddings (short, latency-sensitive, 100–300 ms SLO) and document embeddings (long, batch-ingested) drives per-class serving optimisations.

Properties relevant to system design¶

Embedding-inference serving stack — the 2025-12-18 engineering blog post documents the production stack for the query side: vLLM with padding removal as the inference engine, Redis + Lua atomic script as the batch-claim queue (patterns/atomic-conditional-batch-claim), token-count-based batching to the model-and-hardware-specific saturation point (~600 tokens for voyage-3 on A100).
Non-durable queue + 503 fallback — "the probability of Redis losing data is very low. In the rare case that it does happen, users may receive 503 Service Unavailable errors and can simply retry." Clients must be idempotent.
Gradual model onboarding — 7+ models migrated off the legacy Hugging Face Inference
no-batching pipeline onto vLLM + token-count batching over time.

Reported production numbers¶

From the 2025-12-18 post, for the query side of embedding serving:

50 % GPU-inference-latency reduction (voyage-3-large vs old pipeline).
3× fewer GPUs for the same workload.
Across 7+ models onboarded:
- Up to ~20 ms GPU-inference-time drop via vLLM + padding removal.
- Up to 8× throughput improvement via token-count batching.
- P90 end-to-end latency drops by 60+ ms on some model servers under contention.
- P90 more stable during traffic spikes, even with fewer GPUs.

Disclaimer: numbers reflect Voyage AI's specific new-vs-old pipelines; "not necessarily generalisable".

Integration with MongoDB Atlas¶

Following the 2025 acquisition, Voyage AI embeddings + rerankers are pitched as the native embedding + reranking layer for Atlas Vector Search + Atlas Hybrid Search. The 2025-09-30 hybrid-search post names "cross-encoders, learning-to-rank, and dynamic scoring profiles" as the emerging re-ranking layer above hybrid retrieval — implicit Voyage AI integration direction. The 2025-09-25 From Niche NoSQL to Enterprise Powerhouse post describes Voyage AI as "embedding- generation-as-a-service" inside MongoDB's unified developer experience.

Seen in¶

sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference — canonical wiki instance for Voyage AI's query-embedding serving architecture; sole post so far with infrastructure-level detail.
sources/2025-09-25-mongodb-carrying-complexity-delivering-agility — Voyage AI cited as "driving native embedding + reranking integration" into Atlas.
sources/2025-09-25-mongodb-from-niche-nosql-to-enterprise-powerhouse — Voyage AI framed as "embedding-generation-as-a-service" in Atlas's unified developer experience.
sources/2025-09-30-mongodb-top-considerations-when-choosing-a-hybrid-search-solution — cross-encoder reranking named as the emerging layer above hybrid retrieval; implicit Voyage AI integration.

systems/voyage-3-large — specific model line covered in the serving-infra post.
systems/vllm — production inference engine Voyage AI standardised on.
systems/redis — queue substrate for atomic batch claim.
systems/mongodb-atlas — distribution / integration target.
concepts/token-count-based-batching, concepts/padding-removal-inference, concepts/query-vs-document-embedding, concepts/saturation-point-inference.
companies/mongodb.