Skip to content

SYSTEM Cited by 2 sources

vLLM

vLLM is an open-source high-throughput, low-latency inference and serving engine for large language models and transformer-based embedding models. Originally an academic project from UC Berkeley's Sky Computing Lab (Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention", SOSP 2023) and now an active community project at github.com/vllm-project/vllm.

Properties relevant to system design

  • PagedAttention — manages the KV cache in non-contiguous fixed-size blocks, eliminating memory fragmentation and enabling near-zero-waste KV cache for variable-length concurrent sequences.
  • Continuous batching — dynamic insertion of new requests into an in-flight batch at each decoding step, driven by per-request completion — instead of waiting for a batch to finish before starting the next.
  • Padding removal / variable-length attention — via FlashAttention-family varlen kernels, allows sequences of different lengths in the same batch to be concatenated into a super-sequence without (B, S_max) padding waste.
  • Tensor / pipeline / expert parallelism — multi-GPU execution of large models out of the box.
  • OpenAI-compatible API — drop-in replacement for OpenAI API consumers.
  • Quantisation support — FP16 / BF16 / FP8 / INT8 / AWQ / GPTQ / FP4 depending on hardware.

Role in embedding inference

Although most widely associated with LLM text generation, vLLM also serves transformer-based embedding models — the encoder-only use case that drives search / retrieval / recommendation workloads. Padding removal is the critical property for this use case: query- embedding workloads have highly skewed token-length distributions, and (B, S_max) padding wastes most of the GPU's work. Combined with token-count-based batching driven by an external scheduler, vLLM's padding removal lets the inference system align GPU work with actual token count and approach the saturation point at which MFU peaks.

Seen in

  • 2025-12-18 Voyage AI / MongoDB — Token-count-based batching — vLLM replaces Hugging Face Inference as the inference engine for Voyage AI's query embedding pipeline. Padding removal is named as "the key technique that enables effective batching". Engine-level result: "vLLM reduces GPU inference time by up to ~20 ms for most of our models." Combined with token-count batching drives the headline 50 % / 3×-GPU / up-to-8×-throughput win on voyage-3-large and 6 other models. (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference)
  • 2026-04-16 Cloudflare — Building the foundation for running extra-large language models — vLLM named as the reference baseline Cloudflare's proprietary Infire engine measures against. Framed as memory-overhead-heavier than Infire: "While already having much lower GPU memory overhead than vLLM, we optimized Infire even further." Claim: Infire runs Llama 4 Scout on 2× H200 with >56 GiB KV room (~1.2M tokens), Kimi K2.5 on 8× H100 (not H200) with >30 GiB KV room — "In both cases you would have trouble even booting vLLM in the first place." Sided comparison (Cloudflare's framing, no third-party benchmark); notable as an industrial claim that a bespoke engine beats vLLM's activation-memory discipline by wide enough margins to force a hardware-class difference. (sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Last updated · 200 distilled / 1,178 read