Skip to content

SYSTEM Cited by 1 source

Databricks FMAPI Prompt Caching

Databricks FMAPI Prompt Caching is the named platform feature (GA on open-weights models 2026-05-22) that ships KV-cache reuse for repeated prompt prefixes as a default-on, zero-configuration capability of the Foundation Model APIs (FMAPIs). Customers do not opt in, do not configure cache keys, and do not change SDK calls — the platform automatically detects and reuses cached prefill state across requests sharing the same prompt prefix.

Prior to 2026-05-22, Databricks offered prompt caching only for the proprietary models (GPT, Gemini, Claude) routed through FMAPI. The 2026-05-22 announcement extends parity to open-weights models hosted directly on Databricks infrastructureGPT-OSS 20B + 120B, Gemma 3 12B, Llama 3.1 8B (including PEFT-served fine-tuned variants), and Llama 3.3 70B — across all three commercial workload tiers (batch inference, pay-per-token, provisioned-throughput). Source: sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models.

Three load-bearing design choices

1. Implicit — no customer configuration

"The caching is implicit: customers do not need to configure anything, our system has built to automatically run the prompt caching and reuse to improve throughput."

Contrasts with the explicit caching APIs of frontier-model providers: Anthropic's cache_control blocks, OpenAI's prompt_cache_key, Google's context-cache create / TTL APIs. The architectural consequence: caching is a substrate property of FMAPI rather than a feature of the request API. See concepts/implicit-prompt-caching.

2. Volatile-memory only, never persisted

"Prompt caches are isolated, only reside in volatile memory and are never persisted."

Three orthogonal properties compose the safety envelope: isolated (per-tenant cache pools), volatile-memory only (RAM-resident, not on disk), never persisted (replica restart wipes it). The composition lets Databricks ship caching as default-on on multi-tenant infrastructure without dragging in an encryption-at-rest / key-management story. See concepts/volatile-only-prompt-cache-isolation.

3. Inherited platform-wide

"It also applies to any and all higher-level services powered by a foundation model, e.g., Agent Bricks, Genie, AI Functions."

Because the caching is at the FMAPI substrate, every Databricks product layered on top inherits the throughput / latency / cost win without integration work. See patterns/implicit-prompt-cache-as-platform-default.

Disclosed numbers (GPT-OSS batch-inference rollout)

The 2026-05-22 post quantifies the GPT-OSS rollout in "one of the large-scale production batch-inference pipelines":

Metric Result
Per-replica input-token throughput +2.5×
P50 latency 3× reduction
Cache hit ratio 30% (described as "relatively low")

The asymmetry is structurally explained by prefill-skip economics: when a request hits the cache, it completely skips the prefill stage — typically the dominant cost on long-prefix prompts — so a 30% hit on a workload where prefill dominates per-request cost moves throughput dramatically.

Supported open-source models (as of 2026-05-22)

  • GPT-OSS 20B + 120B
  • Gemma 3 12B
  • Fine-tuned Llama 3.1 8B (via PEFT serving — caching survives the adapter layer because the adapter is consistent across requests on the same endpoint, preserving prefix KV reusability)
  • Llama 3.1 8B
  • Llama 3.3 70B

Forward commitment: "We will continue to roll out this feature across our other models."

Workload-tier coverage

All three FMAPI commercial workload modes get caching:

  • Batch inference — most cache-friendly: large batches over many documents typically share a static system / instruction prompt. The disclosed 2.5× / 3× / 30% numbers are from a batch-inference pipeline.
  • Pay-per-token — interactive single requests; hit ratio driven by traffic shape (repeated system prompts across users on the same product).
  • Provisioned-throughput — dedicated capacity workloads; caching is on by default at the same engine level.

Quality lever, not just cost lever

The post explicitly frames caching as a quality enabler:

"Prompt caching can be a powerful technique to raise a model's quality in specific domains without compromising the model's token throughput. Queries can share a large domain-specific system prompt, with the compute cost of that shared prompt amortized across all those queries. Frontier models, such as Claude, use system prompts that are many thousands of tokens long under the hood. Furthermore, in our recently published research we showed that automated prompt optimization allows open-source models to surpass frontier-model quality for enterprise tasks."

Without caching, a long domain-specific system prompt is a token-budget tax that pushes teams toward shorter prompts and lower quality. With caching, the long system prompt is prefilled once and reused — production teams can ship long, optimised prompts that previously wouldn't have paid back. This connects to Databricks' April-2026 automated prompt optimisation research as the serving primitive that makes long optimised prompts production-viable.

Relationship to other primitives

Caveats

  • Mechanism not disclosed: cache-key derivation (token-block- aligned vs full-prefix), eviction policy (LRU? size-based?), per- replica cache size, TTL, cross-replica or cross-region cache sharing — none are specified.
  • Tenant-isolation mechanism not disclosed: the post claims tenant-bounded caches but does not name the isolation primitive (process boundary? key-prefixed pool? VM boundary?).
  • No disclosed pricing differential for cached vs uncached input tokens. Compare: Cloudflare's x-session-affinity documentation explicitly markets "discounted cached tokens" as the customer incentive.
  • Single rollout pipeline as evidence base: the 2.5× / 3× / 30% numbers come from one named GPT-OSS batch-inference pipeline. The hit-ratio distribution across the model catalog and the other workload tiers is not disclosed.

Seen in

Last updated · 542 distilled / 1,571 read