Skip to content

MONGODB 2025-12-18 Tier 2

Read original ↗

MongoDB (Voyage AI) — Token-count-based Batching: Faster, Cheaper Embedding Inference for Queries

Summary

Voyage AI by MongoDB describes the production embedding-inference pipeline it runs for query embeddings (the short, latency-sensitive side of retrieval workloads) and the two compounding primitives that let it cut GPU inference latency by 50 % while using 3× fewer GPUs: padding removal (variable-length attention over a concatenated super-sequence instead of padded (B, S) tensors — supported in vLLM and SGLang) + token-count-based batching (claim requests atomically up to a target Σ token_count ≤ optimal_batch_size instead of by request count or time window). The post distinguishes queries vs documents as two different serving problems: query token-length distributions are short and highly-skewed, query inference is memory-bound (far from the saturation point of the GPU's compute), and query traffic is spiky — autoscaling responds too slowly to smooth the spikes, so serving short requests sequentially wastes GPU. Profiling on voyage-3 / A100 yields a saturation point of ~600 tokens: latency is approximately flat below 600 tokens (per-request fixed costs dominate — scheduling, memory movement, pooling / normalisation) and approximately linear above 600 tokens (model FLOPs saturate the compute units). Batching up to that point maximises model FLOPs utilisation (MFU) and throughput in the same forward pass. The queue-design half of the post names the peek + atomic-claim-up-to-budget primitive as not satisfied by RabbitMQ (prefetch is request-count-based, push model, no peek) or Kafka (batches by bytes / messages within a partition, token-count unknown without tokenising) — and articulates two practical paths: a lightweight aggregator sitting in front of Kafka / RabbitMQ that consumes into a token-count batcher before dispatching to model servers, or a store that natively supports fast peek + conditional batching — Redis with an atomic Lua script that pops items until the total-token budget is reached, with per-item TTLs set in the same call. Voyage AI picked Redis-with-Lua. Headline production numbers from gradually onboarding 7+ models off the old no-batching + Hugging Face Inference pipeline onto the new token-count-batched + vLLM pipeline: up to ~20 ms GPU-inference-time reduction per model via vLLM + padding removal alone, up to 8× throughput improvement via token-count batching, some model servers see P90 end-to-end latency drop by 60+ ms as queueing time shortens under contention, P90 more stable during traffic spikes even with fewer GPUs. The post explicitly disclaims that "these results are based on our specific implementations of the 'new' and 'old' pipelines, and are not necessarily generalisable" — the claim is about the architecture shape, not the specific magnitudes.

Key takeaways

  1. Short-request embedding inference is memory-bound, not compute-bound — batching moves it to compute-bound. The GPU spends its fixed per-request overheads (kernel launches, scheduling, attention-mask setup, pooling, normalisation) on tiny sequences and never approaches the saturation point where latency starts scaling linearly with token count. Below 600 tokens on voyage-3 / A100 latency is roughly flat. Combining many short sequences into one forward pass (a super-sequence of length T = Σ token_count_i) amortises those fixed costs and moves the workload toward compute-bound, raising both MFU and throughput nearly linearly until the saturation point. Serving sequentially leaves that region of the GPU curve unused (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

  2. Padding removal is the inference-engine primitive that makes efficient batching possible. Traditional inference engines accept (B, S_max)-shaped input: all sequences in the batch padded to the longest sequence's length so tensors line up. Padding tokens "do no useful work but still consume compute and memory bandwidth, so latency scales with B × S_max instead of the actual token count." For short-request workloads with a highly-skewed token-length distribution, padding dominates compute and inflates tail latency. Padding removal — supported in vLLM and SGLang — concatenates all active sequences into one long super-sequence of length T = Σ token_count_i; attention masks and position indices ensure each sequence attends only to its own tokens; inference time now tracks T instead of B × S_max. Without padding removal token-count batching gains nothing; with padding removal it is the primary lever (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

  3. Token-count is the right batch-size budget — request-count and time-window both oscillate between under- and over-fill. The post dismisses the two obvious alternatives by first principles: time-window batching oscillates between under-filled batches (short window, low latency, wasted fixed cost per batch) and over-filled batches (long window, higher queueing delay, batch may exceed saturation point) — "a single window size oscillates between under- and over-filling" because traffic is bursty, so the system toggles between memory-bound and compute-bound regimes; request-count batching has the same problem with a different control axis (N requests × unknown per-request token count ≠ stable FLOPs workload). Token-count batching aligns the batch budget directly with the quantity that determines GPU work, so batch-to-batch FLOPs work is predictable and the system stays near the saturation point under varying traffic (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

  4. The optimal batch size is the model-and-hardware-specific saturation point, not a universal constant. "Our inference-latency-vs-token-count profiling of query inference shows a clear pattern: latency is approximately flat up to a threshold (saturation point) and then becomes approximately linear." For voyage-3 on A100 that threshold is ~600 tokens; it depends on model architecture, inference engine, and GPU, and has to be re-measured per (model, engine, GPU) triple. Setting optimal_batch_size = saturation_point balances throughput (compute-bound) against latency (still in the flat region). Most query inferences live in the memory-bound zone, far from the saturation point — which is why the aggregate batching win is so large (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

  5. The queue substrate must support peek + atomic-claim-up-to-budget — general-purpose brokers don't. Token-count batching needs three primitives not in the classical FIFO-delivery model: (a) attach an estimated token_count to each request at enqueue time; (b) peek across pending requests (not consume one at a time); (c) atomically claim a subset whose total tokens fit the optimal batch size. RabbitMQ's prefetch is request-count-based + push (consumers can't peek + selectively claim); Kafka batches by bytes / messages within a partition (token count varies with text + tokeniser — no efficient way to batch by Σ token_count_i). The post names two practical paths: (a) place a lightweight aggregator in front of Kafka / RabbitMQ that consumes batches by token count before dispatching to model servers; (b) use a store that natively supports peek + conditional batching — the Voyage AI choice is Redis + Lua script, popping items until the optimal batch size is reached in a single atomic call, with per-item TTLs set in the same script. Rare Redis data loss → user sees 503 Service Unavailable and retries. At low QPS batches are partially filled, GPU utilisation stays low — "but latency still improves" (same batch won't overfill; fixed overheads still amortise over fewer items) (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

  6. Query traffic is spiky and autoscaling is too slow — batching is the primary smoothing mechanism, not horizontal scale-out. Voyage AI explicitly frames spiky query traffic as a first-class design constraint: "Query traffic is pretty spiky, so autoscaling is too slow." Provisioning GPUs to handle peak sequentially wastes capacity off-peak; relying on autoscaling means tail-latency spikes during every burst because new GPUs take minutes to come online. Token-count batching instead absorbs the burst in-batcher: when many short queries arrive close together they group into a larger combined workload in a single forward pass — the same GPU processes more requests at once without scaling out. The headline result — "P90 end-to-end latency is more stable during traffic spikes, even with fewer GPUs" — is the structural consequence (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

  7. Queries vs documents are two distinct serving problems with different batching regimes. Voyage AI distinguishes them explicitly: queries are "short, and their token-length distribution is highly skewed" with hard latency budgets "typically 100–300 ms"; documents are longer + batch-ingested offline. Query inference is memory-bound so aggressive short-request batching wins; document inference is already compute-bound / saturation-adjacent so the batching-for-MFU argument is weaker and the batching knobs should be tuned differently. The optimal batching strategy is request-class- specific, not global — which is why the post is scoped to "queries" and not generic "embedding inference" (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

Operational numbers (from the post)

Numbers named verbatim in the post. Where the post qualifies a number with "up to" we preserve that qualifier.

  • Saturation point (voyage-3 on A100): ~600 tokens.
  • Query-latency budget: typically 100–300 ms.
  • Headline production result (voyage-3-large query serving, new vs old pipeline): 50 % reduction in GPU inference latency with 3× fewer GPUs.
  • Gradual onboarding rollout across 7+ models:
    • vLLM reduces GPU inference time by up to ~20 ms for most models (engine + padding removal, independent of batching).
    • GPU utilisation and MFU increase — post-batching, inference sits closer to the compute-bound regime.
    • Throughput improves by up to 8× via token-count batching.
    • P90 end-to-end latency drops by 60+ ms on some model servers under contention — queueing time reduced.
    • P90 more stable during traffic spikes — even with fewer GPUs.
  • Old pipeline: no batching + Hugging Face Inference.
  • New pipeline: token-count-batched + vLLM.

Caveats

  • Explicit non-generalisability disclaimer. "These results are based on our specific implementations of the 'new' and 'old' pipelines, and are not necessarily generalisable." Old-pipeline baseline was no batching and HF Inference — so the reported 2×-ish speedup from "vLLM alone" and the 8× throughput from "batching alone" aren't cleanly attributed; some of each gain is the switch between engines. Token-count batching on top of an already-optimised HF Inference deployment would show smaller deltas.
  • Saturation point is per-(model, engine, GPU). 600 tokens is voyage-3 + vLLM + A100. A different model (bigger embedding dimension, deeper transformer, different attention), inference engine (SGLang, TGI, TensorRT-LLM), or GPU (H100, H200, A10G, L4) shifts the flat-then-linear elbow — has to be re-profiled.
  • Token-count estimate is an estimate. Each request is enqueued with an "estimated token_count" — the post doesn't specify whether it's computed by a fast-path estimator (char-count / ratio) or by the actual tokeniser. If it's the actual tokeniser, enqueue path pays tokenisation cost twice (once for batch accounting, once inside the engine); if it's an estimator, batches can over- or under-fill relative to the true Σ token count.
  • Redis as the queue substrate isn't durable by default. The post acknowledges "the probability of Redis losing data is very low. In the rare case that it does happen, users may receive 503 Service Unavailable errors and can simply retry." That's a client-visible correctness trade, not a transparent fallback. Production callers must implement idempotent retry against 503s; long-running async embedding jobs wouldn't tolerate this.
  • No numbers on optimal-batch-size auto-tuning. The post names the saturation point as the right batch size but doesn't describe how the running system tracks or re-measures it under model / GPU / inference-engine changes.
  • No document-side pipeline detail. The query side of the serving stack is thoroughly characterised; document-ingestion batching is mentioned only as "other requests are called documents", not designed in detail.
  • No serving-stack topology details. GPU count / model-server replica count / horizontal-pod-autoscaler target / Redis cluster topology / cross-region routing are all off-post. The 3× GPU reduction is an aggregate; absolute numbers not given.
  • No cost or $/M-embedding figures. "Faster, Cheaper" is the title, but the cost side is only implied through the 3× GPU reduction; no explicit $ / M-embeddings or COGS delta disclosed.
  • Voyage-3 / voyage-3-large distinction is slightly muddled. The saturation-point profiling numbers are attributed to "our voyage-3 model running on A100"; the headline 50 % latency / 3× GPU result is from a "production experiment on the Voyage-3-Large model serving". Both are named; the post doesn't distinguish the batch-size analysis per model family beyond the voyage-3 600-token saturation point.

Source

Last updated · 200 distilled / 1,178 read