SYSTEM Cited by 2 sources

SGLang¶

SGLang (Structured Generation Language) is an open-source LLM serving engine from UC Berkeley's Sky Computing Lab (github.com/sgl-project/sglang, arXiv:2312.07104) focused on high-throughput structured generation — complex LLM workflows with branching control flow, constrained decoding, and prefix sharing.

Properties relevant to system design¶

RadixAttention — prefix-aware KV cache sharing across requests, large wins on workloads where many requests share prompt prefixes (few-shot examples, system prompts, retrieval augmentation).
Padding removal / variable-length attention — same FlashAttention-varlen class of kernels that vLLM supports; concatenated super-sequence of length Σ token_count_i instead of (B, S_max) padding.
Structured generation primitives — JSON mode, regex- constrained decoding, branching programs.
Continuous batching, tensor parallelism, quantisation — production-serving-engine table stakes shared with vLLM / TensorRT-LLM.

Relevance¶

On the wiki SGLang appears alongside vLLM as one of the two named engines that support padding removal, enabling token-count-based batching for short-request embedding / LLM inference workloads.

Seen in¶

2025-12-18 Voyage AI / MongoDB — Token-count-based batching — named alongside vLLM as an inference engine supporting padding removal: "Padding removal, supported in inference engines like vLLM and SGLang, makes efficient batching possible." Voyage AI picked vLLM for production; SGLang cited as an equivalent choice for the same primitive. (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference)
2026-04-16 Cloudflare — Building the foundation for running extra-large language models — SGLang HiCache named as one of two options (the other being LMCache) for cluster-wide shared KV cache above Mooncake Transfer Engine / Mooncake Store in Workers AI's serving stack: "When paired with LMCache or SGLang HiCache, the cache is shared across all nodes in the cluster, allowing a prefill node to identify and re-use a cache from a previous request that was originally pre-filled on a different node." (sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Stub — deeper SGLang-specific architectural content not yet ingested; expand when a dedicated SGLang post lands.