SYSTEM Cited by 1 source
Mooncake Store¶
Overview¶
Mooncake Store (github.com/kvcache-ai/Mooncake) is the cold / warm tier of Moonshot AI's Mooncake KV-cache stack, extending the KV cache from GPU VRAM onto NVMe storage. It complements Mooncake Transfer Engine (the fast-path RDMA transfer fabric) — Transfer Engine moves KV blocks, Store is where cold KV blocks live. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Role¶
"Mooncake Store also allows us to extend the cache beyond GPU VRAM, and leverage NVMe storage. This extends the time that sessions remain in cache, improving our cache hit ratio and allowing us to handle more traffic and offer better performance to users."
Conceptually the KV-cache memory hierarchy looks like:
| Tier | Latency | Capacity | Role |
|---|---|---|---|
| GPU HBM | ~ns | tens of GB | hot: actively decoding |
| Host DRAM | ~100 ns – µs | hundreds of GB | warm: paused sessions |
| Mooncake Store (NVMe) | ~ms | multiple TB | cold: idle sessions + shared long-lived prefixes |
Without a persistent NVMe tier, cache entries die with the GPU process; sessions that pause more than a short window lose their warm KV state and pay a re-prefill cost on resumption. Mooncake Store keeps the KV alive across those windows.
Relationship to session affinity¶
The point of x-session-affinity is to route the next turn of a conversation back to the replica that served the previous turn so the KV cache hits. Without a persistent tier, that hit rate decays with idle time because HBM + DRAM have bounded capacity and evict. Mooncake Store extends the effective residency window — the same session returning after minutes (rather than seconds) can still hit.
Cloudflare measured 60% → 80% peak input-cache-hit ratio after onboarding heavy internal users to the session-affinity header; the Store tier is one of the load-bearing pieces making those hit ratios achievable. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Related primitives¶
- Mooncake Transfer Engine — transport layer; Store is the persistence layer.
- KV cache — what's being stored.
- Session-affinity prompt caching — the client-signal that makes the Store tier pay off (without it, the cache-hit-rate distribution is diffuse and Store helps less).
- LMCache / SGLang HiCache — software layers that pair with Mooncake to expose the cluster-wide shared cache to the serving engine (see systems/lmcache, systems/sglang).
Caveats¶
- Post discloses no numbers for Mooncake Store: capacity per node, eviction policy, hit-rate contribution, lookup latency, NVMe wear-leveling behaviour.
- Block-granularity not disclosed — whether cache is stored at KV-page granularity, per-sequence granularity, or something else.
- Consistency / coherence model not disclosed — what happens if the same prefix is cached on multiple nodes' NVMe tiers and one evicts.
- Failure-mode behaviour not discussed — NVMe failure, partial corruption, hot-node-cold-node load imbalance.
- Positioning relative to vLLM's CPU-offload and other tiered KV solutions not compared.
Open-source origin¶
From Moonshot AI, developers of Kimi K2.5. Part of the same github.com/kvcache-ai/Mooncake repo as the Transfer Engine. See Mooncake paper for the original architecture.
Seen in¶
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — the NVMe tier extension named as load-bearing for cluster-wide KV cache reuse in Workers AI.
Related¶
- systems/mooncake-transfer-engine — paired transport substrate.
- systems/kimi-k2-5 — the model whose serving workloads motivated the design.
- systems/workers-ai / systems/infire — Cloudflare consumers.
- concepts/kv-cache / concepts/session-affinity-prompt-caching
- systems/lmcache / systems/sglang — software layers that expose this tier to the serving engine.
- companies/cloudflare