Skip to content

SYSTEM Cited by 1 source

LMCache

Overview

LMCache (github.com/LMCache/LMCache) is an open-source KV-cache layer for LLM-serving engines that provides cluster-wide prefix / KV reuse across nodes and across inference engines. When paired with transport + persistence substrates like Mooncake Transfer Engine / Mooncake Store, LMCache extends vLLM-style PagedAttention KV cache beyond a single node.

Cloudflare Workers AI uses LMCache (or alternatively SGLang HiCache) as the software layer that exposes the cluster-shared cache to the serving engine: "When paired with LMCache or SGLang HiCache, the cache is shared across all nodes in the cluster, allowing a prefill node to identify and re-use a cache from a previous request that was originally pre-filled on a different node." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Role in the Workers AI stack

The layer split in Workers AI's multi-GPU / multi-node serving:

Layer Component Role
Serving engine Infire / vLLM / SGLang per-replica forward-pass execution, per-replica KV management
Cross-node cache layer LMCache / SGLang HiCache lookup and share KV across replicas
Transport Mooncake Transfer Engine RDMA KV block transfer
Persistence Mooncake Store NVMe cold tier

Without a cache layer like LMCache, a prefill node has no mechanism to discover that another node previously pre-filled the same prefix — the cross-node sharing window is invisible. LMCache exposes the lookup + policy API that turns per-node KV pools into a cluster-wide hit surface.

Consequence: session-aware routing becomes optional within a cluster

"This eliminates the need for session aware routing within a cluster and allows us to load balance the traffic much more evenly."

With LMCache (+ Mooncake) in place, two same-prefix requests can land on different nodes and still hit shared cache — so the load balancer is free to optimise for node load, not session affinity. x-session-affinity hints still matter across clusters / regions, but not within.

  • KV cache — what's being shared.
  • Mooncake Transfer Engine — paired RDMA transport.
  • Mooncake Store — paired NVMe cold tier.
  • SGLang HiCache — sibling cache layer; the Cloudflare post names LMCache or SGLang HiCache as options.
  • vLLM PagedAttention — the per-node KV-cache management LMCache extends cluster-wide.

Caveats

  • Post mentions LMCache only in passing — no numbers, no integration detail, no per-lookup latency characterisation.
  • No disclosure of which option Cloudflare chose — LMCache vs SGLang HiCache — or whether both are in use on different paths.
  • Eviction policy / capacity / consistency model not discussed for the cluster-wide view.

Seen in

Last updated · 200 distilled / 1,178 read