SYSTEM Cited by 1 source
voyage-3 / voyage-3-large¶
The voyage-3 family is Voyage AI's third-generation general-purpose embedding-model line, released in Voyage's 2025-01-07 Voyage-3-Large announcement and positioned as the default embedding tier for retrieval / search / recommendation workloads on Atlas post-acquisition. The family includes voyage-3, voyage-3-lite, voyage-3-xl, and voyage-3-large — the large variant targeted at retrieval-quality-at-scale use cases.
Properties relevant to system design¶
- Serving regime — transformer-encoder inference, query-side workload is memory- bound on common GPUs (the workload batch Voyage AI's 2025-12-18 post builds infrastructure around).
- Saturation point on NVIDIA A100: "~600 tokens". Batch composition around this point (via token-count-based batching on vLLM with padding removal) maximises MFU without adding latency.
- Headline serving result — voyage-3-large query serving on the new token-count-batched + vLLM pipeline vs the old no-batching + HF Inference pipeline: 50 % GPU-inference-latency reduction, 3× fewer GPUs.
- Query-latency SLO: typically 100–300 ms per request.
Nuance between voyage-3 and voyage-3-large in the 2025-12-18 post¶
Both names appear in the post and it's worth flagging the distinction:
- The saturation-point profiling number ("~600 tokens") is attributed to "our voyage-3 model running on A100".
- The headline 50 % / 3×-GPU result is from "a production experiment on the Voyage-3-Large model serving".
So the saturation-point elbow is stated for voyage-3; the headline outcome is for voyage-3-large. The post doesn't further distinguish the batch-size analysis per variant.
Seen in¶
- 2025-12-18 Voyage AI / MongoDB — Token-count-based batching — canonical wiki reference for both variants in the context of serving-infra design. Sole MongoDB engineering post so far with model-specific saturation-point / GPU / latency numbers. (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference)
Related¶
- systems/voyage-ai — the platform / company / product line.
- systems/vllm — production inference engine.
- concepts/saturation-point-inference, concepts/token-count-based-batching, concepts/memory-bound-vs-compute-bound.