CONCEPT Cited by 1 source

Token-count-based batching¶

Definition¶

Token-count-based batching is the GPU-inference-serving discipline of grouping pending requests into batches whose total token count is bounded by a fixed budget — typically the hardware + model's saturation point — instead of by total request count or by an arbitrary time window. The batch budget takes the form

Σ token_count_i ≤ optimal_batch_size

and the scheduler claims pending requests atomically up to that budget before dispatching to the model server (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

Why the other two axes don't work¶

The 2025-12-18 Voyage AI / MongoDB post dismisses the two obvious alternatives by first principles:

Time-window batching — collect requests for Δt ms, dispatch. "Time-window batching swings between under- and over-filled batches depending on traffic." A short window → under-filled batches, low latency but wasted fixed-cost per batch. A long window → over-filled batches, higher queueing delay, risk of exceeding the saturation point. Under bursty traffic a single window size oscillates between under- and over-filling — the system toggles between memory-bound and compute-bound regimes.
Request-count batching — collect N requests, dispatch. Same oscillation, different control axis: N requests × an unknown per-request token count ≠ a stable FLOPs workload. The FLOPs demanded by a batch of N short queries vs N long documents differ by orders of magnitude.

Token count is the quantity that determines GPU work for transformer inference with padding removed: inference time tracks T = Σ token_count_i directly. Bounding the batch by T therefore bounds the batch's compute demand directly, so the system stays near the saturation point across varying traffic shapes (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

Why it's only possible with padding removal¶

On a classical inference engine that pads all sequences in a batch to the longest sequence's length ((B, S_max) tensor shape), inference time scales with B × S_max — not with Σ token_count_i. A batch containing one long sequence and many short ones pays for B × (length of longest) compute regardless of how short the short ones are. In that regime, batching by token count provides no scheduling benefit because the GPU's time cost is B × S_max, not Σ token_count.

Padding removal (supported in vLLM and SGLang) concatenates all active sequences into a single super-sequence of length T = Σ token_count_i, with attention masks + position indices ensuring each sequence attends only to its own tokens. With padding removal, inference time tracks T; then and only then does bounding the batch by Σ token_count bound the batch's forward-pass time (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

What it optimises¶

The combined effect of padding removal + token-count batching is:

Near-linear MFU scaling toward the saturation point as small requests are batched together.
Higher throughput in the same forward pass — more requests served per GPU-second.
Lower per-request latency amortised over the batch (fixed per-batch overheads — kernel launches, scheduling, attention-mask setup, pooling, normalisation — are paid once per super-sequence, not once per request).
Stable tail latency under traffic spikes — bursts of short queries are absorbed into the batch instead of requiring GPU scale-out. Voyage AI reports P90 end-to-end latency more stable during traffic spikes, even with fewer GPUs.

The observed production result of combining padding removal + token-count batching on voyage-3-large query serving vs the old no-batching + HF Inference pipeline: 50 % reduction in GPU inference latency, 3× fewer GPUs, up to 8× throughput improvement across 7+ models onboarded (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

What the substrate has to provide¶

Token-count batching requires the queue / store feeding the model server to support three primitives general-purpose brokers lack (see patterns/atomic-conditional-batch-claim):

Per-request token-count attachment at enqueue time — the quantity the scheduler is budgeting against must travel with the request.
Peek across pending requests — not single-item consume.
Atomic claim up to a budget — pop all items whose cumulative token-count fits the optimal batch size in one operation, so two model-server workers don't race on the same item.

RabbitMQ (request-count prefetch, push delivery, no peek), Kafka (partition-local byte / message batching, token count varies with text + tokeniser) don't satisfy this natively. Voyage AI uses Redis with a Lua script that pops items until the batch budget is reached and sets per-item TTLs in the same atomic call (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

When to apply¶

Token-count batching is most valuable when:

Requests are short and the workload is memory-bound (far from the saturation point).
Token-length distribution is highly skewed (few long requests, many short) — the compute-waste under padding is large.
Traffic is spiky and autoscaling response time is slower than the spike.
Hard latency SLO (e.g. 100–300 ms for retrieval queries).

It's less valuable when:

Requests are already long / batched-offline (document-side embedding ingestion).
Inference is already compute-bound / saturation-adjacent — batching can still help but the delta is smaller.
The inference engine pads inputs anyway — the token-count control axis doesn't align with actual GPU work.

Adjacent batching disciplines¶

Continuous batching (vLLM's flagship technique) — dynamic insertion of new decoding requests into an in-flight batch at each decoding step, driven by per-request completion. Different control axis: token-count batching gates batch admission; continuous batching schedules per-step composition. The two compose.
Dynamic batching (TF Serving, Triton) — group requests by shape / model within an adaptive time window. Classical server-side batching, doesn't address the token-count vs request-count axis described here.
Prefix-shared batching — group requests sharing common prompt prefixes to amortise KV-cache compute. Orthogonal axis; token-count batching doesn't assume prefix structure.

Seen in¶

2025-12-18 Voyage AI / MongoDB — Token-count-based batching — canonical wiki instance; voyage-3 on A100, saturation point ~600 tokens; batch claimed from Redis via an atomic Lua script up to optimal_batch_size (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).