CONCEPT Cited by 1 source
Token-count-based batching¶
Definition¶
Token-count-based batching is the GPU-inference-serving discipline of grouping pending requests into batches whose total token count is bounded by a fixed budget — typically the hardware + model's saturation point — instead of by total request count or by an arbitrary time window. The batch budget takes the form
and the scheduler claims pending requests atomically up to that budget before dispatching to the model server (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).
Why the other two axes don't work¶
The 2025-12-18 Voyage AI / MongoDB post dismisses the two obvious alternatives by first principles:
- Time-window batching — collect requests for
Δtms, dispatch. "Time-window batching swings between under- and over-filled batches depending on traffic." A short window → under-filled batches, low latency but wasted fixed-cost per batch. A long window → over-filled batches, higher queueing delay, risk of exceeding the saturation point. Under bursty traffic a single window size oscillates between under- and over-filling — the system toggles between memory-bound and compute-bound regimes. - Request-count batching — collect N requests, dispatch. Same oscillation, different control axis: N requests × an unknown per-request token count ≠ a stable FLOPs workload. The FLOPs demanded by a batch of N short queries vs N long documents differ by orders of magnitude.
Token count is the quantity that determines GPU work for
transformer inference with
padding removed: inference time tracks T = Σ token_count_i
directly. Bounding the batch by T therefore bounds the batch's
compute demand directly, so the system stays near the saturation
point across varying traffic shapes
(Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).
Why it's only possible with padding removal¶
On a classical inference engine that pads all sequences in a batch
to the longest sequence's length ((B, S_max) tensor shape),
inference time scales with B × S_max — not with
Σ token_count_i. A batch containing one long sequence and many
short ones pays for B × (length of longest) compute regardless
of how short the short ones are. In that regime, batching by
token count provides no scheduling benefit because the GPU's time
cost is B × S_max, not Σ token_count.
Padding removal (supported
in vLLM and SGLang) concatenates
all active sequences into a single super-sequence of length T =
Σ token_count_i, with attention masks + position indices ensuring
each sequence attends only to its own tokens. With padding removal,
inference time tracks T; then and only then does bounding the
batch by Σ token_count bound the batch's forward-pass time
(Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).
What it optimises¶
The combined effect of padding removal + token-count batching is:
- Near-linear MFU scaling toward the saturation point as small requests are batched together.
- Higher throughput in the same forward pass — more requests served per GPU-second.
- Lower per-request latency amortised over the batch (fixed per-batch overheads — kernel launches, scheduling, attention-mask setup, pooling, normalisation — are paid once per super-sequence, not once per request).
- Stable tail latency under traffic spikes — bursts of short queries are absorbed into the batch instead of requiring GPU scale-out. Voyage AI reports P90 end-to-end latency more stable during traffic spikes, even with fewer GPUs.
The observed production result of combining padding removal + token-count batching on voyage-3-large query serving vs the old no-batching + HF Inference pipeline: 50 % reduction in GPU inference latency, 3× fewer GPUs, up to 8× throughput improvement across 7+ models onboarded (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).
What the substrate has to provide¶
Token-count batching requires the queue / store feeding the model server to support three primitives general-purpose brokers lack (see patterns/atomic-conditional-batch-claim):
- Per-request token-count attachment at enqueue time — the quantity the scheduler is budgeting against must travel with the request.
- Peek across pending requests — not single-item consume.
- Atomic claim up to a budget — pop all items whose cumulative token-count fits the optimal batch size in one operation, so two model-server workers don't race on the same item.
RabbitMQ (request-count prefetch, push delivery, no peek), Kafka (partition-local byte / message batching, token count varies with text + tokeniser) don't satisfy this natively. Voyage AI uses Redis with a Lua script that pops items until the batch budget is reached and sets per-item TTLs in the same atomic call (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).
When to apply¶
Token-count batching is most valuable when:
- Requests are short and the workload is memory-bound (far from the saturation point).
- Token-length distribution is highly skewed (few long requests, many short) — the compute-waste under padding is large.
- Traffic is spiky and autoscaling response time is slower than the spike.
- Hard latency SLO (e.g. 100–300 ms for retrieval queries).
It's less valuable when:
- Requests are already long / batched-offline (document-side embedding ingestion).
- Inference is already compute-bound / saturation-adjacent — batching can still help but the delta is smaller.
- The inference engine pads inputs anyway — the token-count control axis doesn't align with actual GPU work.
Adjacent batching disciplines¶
- Continuous batching (vLLM's flagship technique) — dynamic insertion of new decoding requests into an in-flight batch at each decoding step, driven by per-request completion. Different control axis: token-count batching gates batch admission; continuous batching schedules per-step composition. The two compose.
- Dynamic batching (TF Serving, Triton) — group requests by shape / model within an adaptive time window. Classical server-side batching, doesn't address the token-count vs request-count axis described here.
- Prefix-shared batching — group requests sharing common prompt prefixes to amortise KV-cache compute. Orthogonal axis; token-count batching doesn't assume prefix structure.
Seen in¶
- 2025-12-18 Voyage AI / MongoDB — Token-count-based batching
— canonical wiki instance; voyage-3 on A100, saturation point
~600 tokens; batch claimed from Redis via an atomic Lua script
up to
optimal_batch_size(sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).