SYSTEM Cited by 2 sources
SGLang¶
SGLang (Structured Generation Language) is an open-source LLM serving engine from UC Berkeley's Sky Computing Lab (github.com/sgl-project/sglang, arXiv:2312.07104) focused on high-throughput structured generation — complex LLM workflows with branching control flow, constrained decoding, and prefix sharing.
Properties relevant to system design¶
- RadixAttention — prefix-aware KV cache sharing across requests, large wins on workloads where many requests share prompt prefixes (few-shot examples, system prompts, retrieval augmentation).
- Padding removal /
variable-length attention — same FlashAttention-varlen class
of kernels that vLLM supports; concatenated
super-sequence of length
Σ token_count_iinstead of(B, S_max)padding. - Structured generation primitives — JSON mode, regex- constrained decoding, branching programs.
- Continuous batching, tensor parallelism, quantisation — production-serving-engine table stakes shared with vLLM / TensorRT-LLM.
Relevance¶
On the wiki SGLang appears alongside vLLM as one of the two named engines that support padding removal, enabling token-count-based batching for short-request embedding / LLM inference workloads.
Seen in¶
- 2025-12-18 Voyage AI / MongoDB — Token-count-based batching — named alongside vLLM as an inference engine supporting padding removal: "Padding removal, supported in inference engines like vLLM and SGLang, makes efficient batching possible." Voyage AI picked vLLM for production; SGLang cited as an equivalent choice for the same primitive. (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference)
- 2026-04-16 Cloudflare — Building the foundation for running extra-large language models — SGLang HiCache named as one of two options (the other being LMCache) for cluster-wide shared KV cache above Mooncake Transfer Engine / Mooncake Store in Workers AI's serving stack: "When paired with LMCache or SGLang HiCache, the cache is shared across all nodes in the cluster, allowing a prefill node to identify and re-use a cache from a previous request that was originally pre-filled on a different node." (sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Stub — deeper SGLang-specific architectural content not yet ingested; expand when a dedicated SGLang post lands.