Skip to content

SYSTEM Cited by 1 source

Mooncake Store

Overview

Mooncake Store (github.com/kvcache-ai/Mooncake) is the cold / warm tier of Moonshot AI's Mooncake KV-cache stack, extending the KV cache from GPU VRAM onto NVMe storage. It complements Mooncake Transfer Engine (the fast-path RDMA transfer fabric) — Transfer Engine moves KV blocks, Store is where cold KV blocks live. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Role

"Mooncake Store also allows us to extend the cache beyond GPU VRAM, and leverage NVMe storage. This extends the time that sessions remain in cache, improving our cache hit ratio and allowing us to handle more traffic and offer better performance to users."

Conceptually the KV-cache memory hierarchy looks like:

Tier Latency Capacity Role
GPU HBM ~ns tens of GB hot: actively decoding
Host DRAM ~100 ns – µs hundreds of GB warm: paused sessions
Mooncake Store (NVMe) ~ms multiple TB cold: idle sessions + shared long-lived prefixes

Without a persistent NVMe tier, cache entries die with the GPU process; sessions that pause more than a short window lose their warm KV state and pay a re-prefill cost on resumption. Mooncake Store keeps the KV alive across those windows.

Relationship to session affinity

The point of x-session-affinity is to route the next turn of a conversation back to the replica that served the previous turn so the KV cache hits. Without a persistent tier, that hit rate decays with idle time because HBM + DRAM have bounded capacity and evict. Mooncake Store extends the effective residency window — the same session returning after minutes (rather than seconds) can still hit.

Cloudflare measured 60% → 80% peak input-cache-hit ratio after onboarding heavy internal users to the session-affinity header; the Store tier is one of the load-bearing pieces making those hit ratios achievable. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Caveats

  • Post discloses no numbers for Mooncake Store: capacity per node, eviction policy, hit-rate contribution, lookup latency, NVMe wear-leveling behaviour.
  • Block-granularity not disclosed — whether cache is stored at KV-page granularity, per-sequence granularity, or something else.
  • Consistency / coherence model not disclosed — what happens if the same prefix is cached on multiple nodes' NVMe tiers and one evicts.
  • Failure-mode behaviour not discussed — NVMe failure, partial corruption, hot-node-cold-node load imbalance.
  • Positioning relative to vLLM's CPU-offload and other tiered KV solutions not compared.

Open-source origin

From Moonshot AI, developers of Kimi K2.5. Part of the same github.com/kvcache-ai/Mooncake repo as the Transfer Engine. See Mooncake paper for the original architecture.

Seen in

Last updated · 200 distilled / 1,178 read