Skip to content

CONCEPT Cited by 1 source

Token-count-based batching

Definition

Token-count-based batching is the GPU-inference-serving discipline of grouping pending requests into batches whose total token count is bounded by a fixed budget — typically the hardware + model's saturation point — instead of by total request count or by an arbitrary time window. The batch budget takes the form

Σ token_count_i ≤ optimal_batch_size

and the scheduler claims pending requests atomically up to that budget before dispatching to the model server (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

Why the other two axes don't work

The 2025-12-18 Voyage AI / MongoDB post dismisses the two obvious alternatives by first principles:

  • Time-window batching — collect requests for Δt ms, dispatch. "Time-window batching swings between under- and over-filled batches depending on traffic." A short window → under-filled batches, low latency but wasted fixed-cost per batch. A long window → over-filled batches, higher queueing delay, risk of exceeding the saturation point. Under bursty traffic a single window size oscillates between under- and over-filling — the system toggles between memory-bound and compute-bound regimes.
  • Request-count batching — collect N requests, dispatch. Same oscillation, different control axis: N requests × an unknown per-request token count ≠ a stable FLOPs workload. The FLOPs demanded by a batch of N short queries vs N long documents differ by orders of magnitude.

Token count is the quantity that determines GPU work for transformer inference with padding removed: inference time tracks T = Σ token_count_i directly. Bounding the batch by T therefore bounds the batch's compute demand directly, so the system stays near the saturation point across varying traffic shapes (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

Why it's only possible with padding removal

On a classical inference engine that pads all sequences in a batch to the longest sequence's length ((B, S_max) tensor shape), inference time scales with B × S_max — not with Σ token_count_i. A batch containing one long sequence and many short ones pays for B × (length of longest) compute regardless of how short the short ones are. In that regime, batching by token count provides no scheduling benefit because the GPU's time cost is B × S_max, not Σ token_count.

Padding removal (supported in vLLM and SGLang) concatenates all active sequences into a single super-sequence of length T = Σ token_count_i, with attention masks + position indices ensuring each sequence attends only to its own tokens. With padding removal, inference time tracks T; then and only then does bounding the batch by Σ token_count bound the batch's forward-pass time (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

What it optimises

The combined effect of padding removal + token-count batching is:

  • Near-linear MFU scaling toward the saturation point as small requests are batched together.
  • Higher throughput in the same forward pass — more requests served per GPU-second.
  • Lower per-request latency amortised over the batch (fixed per-batch overheads — kernel launches, scheduling, attention-mask setup, pooling, normalisation — are paid once per super-sequence, not once per request).
  • Stable tail latency under traffic spikes — bursts of short queries are absorbed into the batch instead of requiring GPU scale-out. Voyage AI reports P90 end-to-end latency more stable during traffic spikes, even with fewer GPUs.

The observed production result of combining padding removal + token-count batching on voyage-3-large query serving vs the old no-batching + HF Inference pipeline: 50 % reduction in GPU inference latency, 3× fewer GPUs, up to 8× throughput improvement across 7+ models onboarded (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

What the substrate has to provide

Token-count batching requires the queue / store feeding the model server to support three primitives general-purpose brokers lack (see patterns/atomic-conditional-batch-claim):

  1. Per-request token-count attachment at enqueue time — the quantity the scheduler is budgeting against must travel with the request.
  2. Peek across pending requests — not single-item consume.
  3. Atomic claim up to a budget — pop all items whose cumulative token-count fits the optimal batch size in one operation, so two model-server workers don't race on the same item.

RabbitMQ (request-count prefetch, push delivery, no peek), Kafka (partition-local byte / message batching, token count varies with text + tokeniser) don't satisfy this natively. Voyage AI uses Redis with a Lua script that pops items until the batch budget is reached and sets per-item TTLs in the same atomic call (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

When to apply

Token-count batching is most valuable when:

  • Requests are short and the workload is memory-bound (far from the saturation point).
  • Token-length distribution is highly skewed (few long requests, many short) — the compute-waste under padding is large.
  • Traffic is spiky and autoscaling response time is slower than the spike.
  • Hard latency SLO (e.g. 100–300 ms for retrieval queries).

It's less valuable when:

  • Requests are already long / batched-offline (document-side embedding ingestion).
  • Inference is already compute-bound / saturation-adjacent — batching can still help but the delta is smaller.
  • The inference engine pads inputs anyway — the token-count control axis doesn't align with actual GPU work.

Adjacent batching disciplines

  • Continuous batching (vLLM's flagship technique) — dynamic insertion of new decoding requests into an in-flight batch at each decoding step, driven by per-request completion. Different control axis: token-count batching gates batch admission; continuous batching schedules per-step composition. The two compose.
  • Dynamic batching (TF Serving, Triton) — group requests by shape / model within an adaptive time window. Classical server-side batching, doesn't address the token-count vs request-count axis described here.
  • Prefix-shared batching — group requests sharing common prompt prefixes to amortise KV-cache compute. Orthogonal axis; token-count batching doesn't assume prefix structure.

Seen in

Last updated · 200 distilled / 1,178 read