Skip to content

CONCEPT Cited by 1 source

Padding removal (variable-length inference)

Definition

Padding removal (also called variable-length processing) is the inference-engine technique of serving a batch of variable-length transformer inputs as a single concatenated super-sequence instead of padding each sequence to the batch's longest length. Attention masks + position indices restrict each sequence's attention to its own tokens, so the compute cost tracks the actual token count rather than the padded rectangle (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

Traditional (B, S_max) batching and its waste

Classical inference engines accept requests in the shape (B, S_max) where B is the batch size and S_max is the longest sequence in the batch. All shorter sequences are padded with special <PAD> tokens so tensors line up for uniform GPU kernels.

Consequence: inference time scales with B × S_max, not with the actual token count. When the token-length distribution is highly skewed (few long sequences, many short — the canonical pattern for search / retrieval / recommendation queries), the pad tokens "do no useful work but still consume compute and memory bandwidth, so latency scales with B × S_max instead of the actual token count." In Voyage AI's framing:

  • Wasted compute on pad tokens can be the majority of the batch's GPU work at high skew.
  • Tail latency is inflated — the longest sequence sets the batch's wall-clock time; one outlier slows everybody.
  • Memory bandwidth for pad tokens is real — they move through the pipeline like useful tokens do.

The super-sequence primitive

Padding removal restructures the batch: concatenate all active sequences end-to-end into one long super-sequence of length

T = Σ token_count_i

and hand the inference engine a combined tensor of shape (1, T) (or equivalent). Per-sequence boundaries are preserved via two bookkeeping structures that travel with the super-sequence:

  • Attention masks — block tokens from one sequence from attending to tokens of another. Usually implemented as sequence-ID vectors consumed by fused attention kernels (FlashAttention family variants with "varlen" support).
  • Position indices — per-sequence position IDs so each sequence's positional encoding restarts at 0 at its own start, not at the super-sequence start.

The result: inference time now tracks T, so "GPU work aligned with what matters." No pad-token compute, no tail-latency inflation from one long sequence, no wasted memory bandwidth.

Engine support

Not all inference engines implement padding removal. Voyage AI names two that do:

  • vLLM — packed-sequence serving via paged attention + FlashAttention varlen kernels; canonical open-source engine for efficient transformer serving.
  • SGLang — similar packed-sequence support; designed for LLM serving with structured generation.

Engines without padding removal (e.g. Hugging Face Inference in its default configuration — the baseline Voyage AI compared against) pay the B × S_max cost unconditionally.

Why padding removal unlocks token-count batching

Token-count-based batching is only meaningful on an engine with padding removal. Without padding removal, inference time is B × S_max — bounding the batch by Σ token_count doesn't bound the batch's forward-pass time. Padding removal aligns the GPU's work axis with the scheduler's budget axis, at which point gating batch admission by token count gates batch latency.

The two primitives compose:

padding removal  : inference time = T = Σ token_count_i
token-count batch: admit requests until Σ token_count ≤ optimal_batch_size

→ batch forward-pass latency ≤ inference-time-at-optimal_batch_size (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

Cost side

Padding removal is not free:

  • Attention-kernel complexity — naive (B, S_max) attention is simpler than varlen. Requires FlashAttention-family kernels with cu_seqlens / similar packed-sequence APIs.
  • Batch composition overhead — the scheduler now has to track per-sequence boundaries, cumulative lengths, per-sequence positions across batches.
  • Tooling / debugging — shape is no longer (B, S); sequence-level profiling requires boundary-aware tooling.

For high-skew, latency-sensitive, memory-bound workloads — exactly the query-side embedding case — the economic win dwarfs the cost.

Seen in

Last updated · 200 distilled / 1,178 read