Skip to content

SYSTEM Cited by 2 sources

SGLang

SGLang (Structured Generation Language) is an open-source LLM serving engine from UC Berkeley's Sky Computing Lab (github.com/sgl-project/sglang, arXiv:2312.07104) focused on high-throughput structured generation — complex LLM workflows with branching control flow, constrained decoding, and prefix sharing.

Properties relevant to system design

  • RadixAttention — prefix-aware KV cache sharing across requests, large wins on workloads where many requests share prompt prefixes (few-shot examples, system prompts, retrieval augmentation).
  • Padding removal / variable-length attention — same FlashAttention-varlen class of kernels that vLLM supports; concatenated super-sequence of length Σ token_count_i instead of (B, S_max) padding.
  • Structured generation primitives — JSON mode, regex- constrained decoding, branching programs.
  • Continuous batching, tensor parallelism, quantisation — production-serving-engine table stakes shared with vLLM / TensorRT-LLM.

Relevance

On the wiki SGLang appears alongside vLLM as one of the two named engines that support padding removal, enabling token-count-based batching for short-request embedding / LLM inference workloads.

Seen in

Stub — deeper SGLang-specific architectural content not yet ingested; expand when a dedicated SGLang post lands.

Last updated · 200 distilled / 1,178 read