SYSTEM Cited by 3 sources
vLLM¶
vLLM is an open-source high-throughput, low-latency inference and serving engine for large language models and transformer-based embedding models. Originally an academic project from UC Berkeley's Sky Computing Lab (Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention", SOSP 2023) and now an active community project at github.com/vllm-project/vllm.
Properties relevant to system design¶
- PagedAttention — manages the KV cache in non-contiguous fixed-size blocks, eliminating memory fragmentation and enabling near-zero-waste KV cache for variable-length concurrent sequences.
- Continuous batching — dynamic insertion of new requests into an in-flight batch at each decoding step, driven by per-request completion — instead of waiting for a batch to finish before starting the next.
- Padding removal /
variable-length attention — via FlashAttention-family varlen
kernels, allows sequences of different lengths in the same batch
to be concatenated into a super-sequence without
(B, S_max)padding waste. - Tensor / pipeline / expert parallelism — multi-GPU execution of large models out of the box.
- OpenAI-compatible API — drop-in replacement for OpenAI API consumers.
- Quantisation support — FP16 / BF16 / FP8 / INT8 / AWQ / GPTQ / FP4 depending on hardware.
Role in embedding inference¶
Although most widely associated with LLM text generation, vLLM also
serves transformer-based embedding models — the encoder-only
use case that drives search / retrieval / recommendation workloads.
Padding removal is the critical property for this use case: query-
embedding workloads have
highly skewed token-length
distributions, and (B, S_max) padding wastes most of the GPU's
work. Combined with
token-count-based batching
driven by an external scheduler, vLLM's padding removal lets the
inference system align GPU work with actual token count and
approach the saturation
point at which MFU peaks.
Seen in¶
- 2025-12-18 Voyage AI / MongoDB — Token-count-based batching — vLLM replaces Hugging Face Inference as the inference engine for Voyage AI's query embedding pipeline. Padding removal is named as "the key technique that enables effective batching". Engine-level result: "vLLM reduces GPU inference time by up to ~20 ms for most of our models." Combined with token-count batching drives the headline 50 % / 3×-GPU / up-to-8×-throughput win on voyage-3-large and 6 other models. (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference)
- 2026-05-08 Databricks × Superhuman — 200K-QPS grammar-correction migration. Joint Databricks Model Serving / Superhuman post documents Superhuman's pre-migration DIY serving stack as "a DIY serving stack built on vLLM, alongside internal tools for training and model management" on L40S GPUs at 200K+ QPS peak with sub- 1-second p99. After migrating the engine to Databricks Model Serving on H100, vLLM stayed in the toolchain as the prequantisation library: "Superhuman's ML team prequantized the checkpoint to FP8 using vLLM's online quantization library, producing a compressed-tensor format checkpoint that Databricks loaded for serving." This is the canonical wiki datum that vLLM is the de-facto FP8 prequantisation toolchain on the serving-engine ecosystem even when teams move to a different engine for production serving. Pre-migration pain points named: "Each new iteration of the model required months of manual performance tuning to onboard", "capacity planning, performance tuning, and autoscaling consuming time from a lean team" — framing of vLLM-on-L40S as a stack that scaled but burned ML-platform-team time at 200K QPS. (Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)
- 2026-04-16 Cloudflare — Building the foundation for running extra-large language models — vLLM named as the reference baseline Cloudflare's proprietary Infire engine measures against. Framed as memory-overhead-heavier than Infire: "While already having much lower GPU memory overhead than vLLM, we optimized Infire even further." Claim: Infire runs Llama 4 Scout on 2× H200 with >56 GiB KV room (~1.2M tokens), Kimi K2.5 on 8× H100 (not H200) with >30 GiB KV room — "In both cases you would have trouble even booting vLLM in the first place." Sided comparison (Cloudflare's framing, no third-party benchmark); notable as an industrial claim that a bespoke engine beats vLLM's activation-memory discipline by wide enough margins to force a hardware-class difference. (sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Related¶
- Sibling engine: systems/sglang — also supports padding removal / packed-sequence serving; more focused on structured LLM generation.
- Replaced: systems/huggingface-inference as Voyage AI's baseline engine.
- Concepts: concepts/padding-removal-inference, concepts/token-count-based-batching, concepts/memory-bound-vs-compute-bound, concepts/saturation-point-inference, concepts/model-flops-utilization.