SYSTEM Cited by 1 source
Databricks FMAPI Prompt Caching¶
Databricks FMAPI Prompt Caching is the named platform feature (GA on open-weights models 2026-05-22) that ships KV-cache reuse for repeated prompt prefixes as a default-on, zero-configuration capability of the Foundation Model APIs (FMAPIs). Customers do not opt in, do not configure cache keys, and do not change SDK calls — the platform automatically detects and reuses cached prefill state across requests sharing the same prompt prefix.
Prior to 2026-05-22, Databricks offered prompt caching only for the proprietary models (GPT, Gemini, Claude) routed through FMAPI. The 2026-05-22 announcement extends parity to open-weights models hosted directly on Databricks infrastructure — GPT-OSS 20B + 120B, Gemma 3 12B, Llama 3.1 8B (including PEFT-served fine-tuned variants), and Llama 3.3 70B — across all three commercial workload tiers (batch inference, pay-per-token, provisioned-throughput). Source: sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models.
Three load-bearing design choices¶
1. Implicit — no customer configuration¶
"The caching is implicit: customers do not need to configure anything, our system has built to automatically run the prompt caching and reuse to improve throughput."
Contrasts with the explicit caching APIs of frontier-model providers:
Anthropic's cache_control blocks, OpenAI's prompt_cache_key,
Google's context-cache create / TTL APIs. The architectural
consequence: caching is a substrate property of FMAPI rather
than a feature of the request API. See
concepts/implicit-prompt-caching.
2. Volatile-memory only, never persisted¶
"Prompt caches are isolated, only reside in volatile memory and are never persisted."
Three orthogonal properties compose the safety envelope: isolated (per-tenant cache pools), volatile-memory only (RAM-resident, not on disk), never persisted (replica restart wipes it). The composition lets Databricks ship caching as default-on on multi-tenant infrastructure without dragging in an encryption-at-rest / key-management story. See concepts/volatile-only-prompt-cache-isolation.
3. Inherited platform-wide¶
"It also applies to any and all higher-level services powered by a foundation model, e.g., Agent Bricks, Genie, AI Functions."
Because the caching is at the FMAPI substrate, every Databricks product layered on top inherits the throughput / latency / cost win without integration work. See patterns/implicit-prompt-cache-as-platform-default.
Disclosed numbers (GPT-OSS batch-inference rollout)¶
The 2026-05-22 post quantifies the GPT-OSS rollout in "one of the large-scale production batch-inference pipelines":
| Metric | Result |
|---|---|
| Per-replica input-token throughput | +2.5× |
| P50 latency | 3× reduction |
| Cache hit ratio | 30% (described as "relatively low") |
The asymmetry is structurally explained by prefill-skip economics: when a request hits the cache, it completely skips the prefill stage — typically the dominant cost on long-prefix prompts — so a 30% hit on a workload where prefill dominates per-request cost moves throughput dramatically.
Supported open-source models (as of 2026-05-22)¶
- GPT-OSS 20B + 120B
- Gemma 3 12B
- Fine-tuned Llama 3.1 8B (via PEFT serving — caching survives the adapter layer because the adapter is consistent across requests on the same endpoint, preserving prefix KV reusability)
- Llama 3.1 8B
- Llama 3.3 70B
Forward commitment: "We will continue to roll out this feature across our other models."
Workload-tier coverage¶
All three FMAPI commercial workload modes get caching:
- Batch inference — most cache-friendly: large batches over many documents typically share a static system / instruction prompt. The disclosed 2.5× / 3× / 30% numbers are from a batch-inference pipeline.
- Pay-per-token — interactive single requests; hit ratio driven by traffic shape (repeated system prompts across users on the same product).
- Provisioned-throughput — dedicated capacity workloads; caching is on by default at the same engine level.
Quality lever, not just cost lever¶
The post explicitly frames caching as a quality enabler:
"Prompt caching can be a powerful technique to raise a model's quality in specific domains without compromising the model's token throughput. Queries can share a large domain-specific system prompt, with the compute cost of that shared prompt amortized across all those queries. Frontier models, such as Claude, use system prompts that are many thousands of tokens long under the hood. Furthermore, in our recently published research we showed that automated prompt optimization allows open-source models to surpass frontier-model quality for enterprise tasks."
Without caching, a long domain-specific system prompt is a token-budget tax that pushes teams toward shorter prompts and lower quality. With caching, the long system prompt is prefilled once and reused — production teams can ship long, optimised prompts that previously wouldn't have paid back. This connects to Databricks' April-2026 automated prompt optimisation research as the serving primitive that makes long optimised prompts production-viable.
Relationship to other primitives¶
- concepts/kv-cache — the underlying transformer-inference primitive being reused.
- concepts/prompt-cache-consistency — sibling concept; with implicit caching, prompt-cache consistency still matters because byte-exact prefix matches drive hit rate. See also patterns/prompt-cache-aware-static-dynamic-ordering.
- concepts/session-affinity-prompt-caching — Cloudflare's
contrasting design (
x-session-affinityheader → client-driven routing → KV cache hit). Cloudflare exposes the cache as an explicit billed primitive ("discounted cached tokens"); Databricks hides the cache as an implicit substrate property without disclosing a separate billing line. - systems/databricks-model-serving — broader managed real-time-inference platform. FMAPI prompt caching is the feature-shape for foundation-model endpoints on this platform.
Caveats¶
- Mechanism not disclosed: cache-key derivation (token-block- aligned vs full-prefix), eviction policy (LRU? size-based?), per- replica cache size, TTL, cross-replica or cross-region cache sharing — none are specified.
- Tenant-isolation mechanism not disclosed: the post claims tenant-bounded caches but does not name the isolation primitive (process boundary? key-prefixed pool? VM boundary?).
- No disclosed pricing differential for cached vs uncached input tokens. Compare: Cloudflare's x-session-affinity documentation explicitly markets "discounted cached tokens" as the customer incentive.
- Single rollout pipeline as evidence base: the 2.5× / 3× / 30% numbers come from one named GPT-OSS batch-inference pipeline. The hit-ratio distribution across the model catalog and the other workload tiers is not disclosed.
Seen in¶
- sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models — canonical wiki instance. GA announcement of implicit prompt caching for OSS models on FMAPI; discloses the GPT-OSS-batch-inference numbers (2.5× / 3× / 30%) and the catalog-extension model list.
Related¶
- systems/databricks-foundation-model-api
- systems/databricks-model-serving
- systems/databricks-genie
- systems/databricks-ai-functions
- systems/gpt-oss / systems/gemma / systems/llama-3-1
- concepts/kv-cache
- concepts/implicit-prompt-caching
- concepts/volatile-only-prompt-cache-isolation
- concepts/prompt-cache-consistency
- concepts/session-affinity-prompt-caching
- patterns/implicit-prompt-cache-as-platform-default
- patterns/prompt-cache-aware-static-dynamic-ordering
- companies/databricks