SYSTEM Cited by 1 source
LMCache¶
Overview¶
LMCache (github.com/LMCache/LMCache) is an open-source KV-cache layer for LLM-serving engines that provides cluster-wide prefix / KV reuse across nodes and across inference engines. When paired with transport + persistence substrates like Mooncake Transfer Engine / Mooncake Store, LMCache extends vLLM-style PagedAttention KV cache beyond a single node.
Cloudflare Workers AI uses LMCache (or alternatively SGLang HiCache) as the software layer that exposes the cluster-shared cache to the serving engine: "When paired with LMCache or SGLang HiCache, the cache is shared across all nodes in the cluster, allowing a prefill node to identify and re-use a cache from a previous request that was originally pre-filled on a different node." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Role in the Workers AI stack¶
The layer split in Workers AI's multi-GPU / multi-node serving:
| Layer | Component | Role |
|---|---|---|
| Serving engine | Infire / vLLM / SGLang | per-replica forward-pass execution, per-replica KV management |
| Cross-node cache layer | LMCache / SGLang HiCache | lookup and share KV across replicas |
| Transport | Mooncake Transfer Engine | RDMA KV block transfer |
| Persistence | Mooncake Store | NVMe cold tier |
Without a cache layer like LMCache, a prefill node has no mechanism to discover that another node previously pre-filled the same prefix — the cross-node sharing window is invisible. LMCache exposes the lookup + policy API that turns per-node KV pools into a cluster-wide hit surface.
Consequence: session-aware routing becomes optional within a cluster¶
"This eliminates the need for session aware routing within a cluster and allows us to load balance the traffic much more evenly."
With LMCache (+ Mooncake) in place, two same-prefix requests can land on different nodes and still hit shared cache — so the load balancer is free to optimise for node load, not session affinity. x-session-affinity hints still matter across clusters / regions, but not within.
Related primitives¶
- KV cache — what's being shared.
- Mooncake Transfer Engine — paired RDMA transport.
- Mooncake Store — paired NVMe cold tier.
- SGLang HiCache — sibling cache layer; the Cloudflare post names LMCache or SGLang HiCache as options.
- vLLM PagedAttention — the per-node KV-cache management LMCache extends cluster-wide.
Caveats¶
- Post mentions LMCache only in passing — no numbers, no integration detail, no per-lookup latency characterisation.
- No disclosure of which option Cloudflare chose — LMCache vs SGLang HiCache — or whether both are in use on different paths.
- Eviction policy / capacity / consistency model not discussed for the cluster-wide view.
Seen in¶
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — named (with SGLang HiCache as alternative) as the software layer that exposes cluster-wide shared KV cache to the serving engine.