CONCEPT Cited by 1 source
Padding removal (variable-length inference)¶
Definition¶
Padding removal (also called variable-length processing) is the inference-engine technique of serving a batch of variable-length transformer inputs as a single concatenated super-sequence instead of padding each sequence to the batch's longest length. Attention masks + position indices restrict each sequence's attention to its own tokens, so the compute cost tracks the actual token count rather than the padded rectangle (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).
Traditional (B, S_max) batching and its waste¶
Classical inference engines accept requests in the shape (B,
S_max) where B is the batch size and S_max is the longest
sequence in the batch. All shorter sequences are padded with
special <PAD> tokens so tensors line up for uniform GPU kernels.
Consequence: inference time scales with B × S_max, not with the
actual token count. When the token-length distribution is highly
skewed (few long sequences, many short — the canonical pattern
for search / retrieval / recommendation queries), the pad tokens
"do no useful work but still consume compute and memory bandwidth,
so latency scales with B × S_max instead of the actual token
count." In Voyage AI's framing:
- Wasted compute on pad tokens can be the majority of the batch's GPU work at high skew.
- Tail latency is inflated — the longest sequence sets the batch's wall-clock time; one outlier slows everybody.
- Memory bandwidth for pad tokens is real — they move through the pipeline like useful tokens do.
The super-sequence primitive¶
Padding removal restructures the batch: concatenate all active sequences end-to-end into one long super-sequence of length
and hand the inference engine a combined tensor of shape (1, T)
(or equivalent). Per-sequence boundaries are preserved via two
bookkeeping structures that travel with the super-sequence:
- Attention masks — block tokens from one sequence from attending to tokens of another. Usually implemented as sequence-ID vectors consumed by fused attention kernels (FlashAttention family variants with "varlen" support).
- Position indices — per-sequence position IDs so each sequence's positional encoding restarts at 0 at its own start, not at the super-sequence start.
The result: inference time now tracks T, so "GPU work aligned
with what matters." No pad-token compute, no tail-latency inflation
from one long sequence, no wasted memory bandwidth.
Engine support¶
Not all inference engines implement padding removal. Voyage AI names two that do:
- vLLM — packed-sequence serving via paged attention + FlashAttention varlen kernels; canonical open-source engine for efficient transformer serving.
- SGLang — similar packed-sequence support; designed for LLM serving with structured generation.
Engines without padding removal (e.g.
Hugging Face Inference in its default configuration — the
baseline Voyage AI compared against) pay the B × S_max cost
unconditionally.
Why padding removal unlocks token-count batching¶
Token-count-based batching
is only meaningful on an engine with padding removal. Without
padding removal, inference time is B × S_max — bounding the batch
by Σ token_count doesn't bound the batch's forward-pass time.
Padding removal aligns the GPU's work axis with the
scheduler's budget axis, at which point gating batch admission
by token count gates batch latency.
The two primitives compose:
padding removal : inference time = T = Σ token_count_i
token-count batch: admit requests until Σ token_count ≤ optimal_batch_size
→ batch forward-pass latency ≤ inference-time-at-optimal_batch_size (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).
Cost side¶
Padding removal is not free:
- Attention-kernel complexity — naive
(B, S_max)attention is simpler than varlen. Requires FlashAttention-family kernels withcu_seqlens/ similar packed-sequence APIs. - Batch composition overhead — the scheduler now has to track per-sequence boundaries, cumulative lengths, per-sequence positions across batches.
- Tooling / debugging — shape is no longer
(B, S); sequence-level profiling requires boundary-aware tooling.
For high-skew, latency-sensitive, memory-bound workloads — exactly the query-side embedding case — the economic win dwarfs the cost.
Seen in¶
- 2025-12-18 Voyage AI / MongoDB — Token-count-based batching — padding removal in vLLM is framed as the "key technique that enables effective batching" for Voyage AI's query-embedding workload; combined with token-count batching to deliver 50 % GPU-inference latency reduction with 3× fewer GPUs (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).