Skip to content

DATABRICKS 2026-05-22 Tier 3

Read original ↗

Databricks — Accelerating LLM Inference with Prompt Caching for Open-Source Models on Databricks

Summary

A short Databricks Blog post (2026-05-22, Tier 3) announcing that implicit prompt caching is now generally available for open-weights models served on the Foundation Model APIs (FMAPIs) — covering batch inference, pay-per-token, and provisioned-throughput workloads, and inherited by every higher-level Databricks AI service (Agent Bricks, Genie, AI Functions). The architectural news is three independent design choices, decided once at the platform layer: (1) implicit caching — customers configure nothing: "the caching is implicit: customers do not need to configure anything, our system has built to automatically run the prompt caching and reuse to improve throughput" (contrast: most frontier-model providers expose explicit cache-control knobs or session keys); (2) volatile-only cache isolation: "Prompt caches are isolated, only reside in volatile memory and are never persisted" — the cache is per-tenant, in-memory, and discarded when the replica restarts, sidestepping any persistence-layer threat model; (3) catalog of supported open-weights models extending caching beyond proprietary tiers: GPT-OSS 20B + 120B, Gemma 3 12B, fine-tuned Llama 3.1 8B (via PEFT serving), Llama 3.1 8B, Llama 3.3 70B — making Databricks one of the first hosted-OSS-LLM products to ship the caching primitive that closed-source vendors (Claude, GPT, Gemini) have offered for over a year. The single quantitative outcome the post discloses is from the GPT-OSS rollout's flagship batch-inference pipeline: +2.5× per-replica input-token throughput, 3× P50 latency reduction, "all this with a relatively low cache hit ratio of 30%". The motivating frame is verbatim from the post: "Large language model (LLM) inference often involves repeated prompts—think of the same system or instruction prompt appearing in thousands of requests. Reprocessing that identical prefix for every call wastes compute cycles, inflates latency, and increases costs. Prompt caching eliminates this redundancy" — by skipping the prefill stage on cache hit (KV tensors for the matched prefix are already in memory) and amortising the prefill compute of the shared prompt across all requests that share it. The post also frames prompt caching as a quality lever, not just a cost lever: "Prompt caching can be a powerful technique to raise a model's quality in specific domains without compromising the model's token throughput. Queries can share a large domain-specific system prompt, with the compute cost of that shared prompt amortized across all those queries." The closing quality argument links to Databricks' own April-2026 Building State-of-the-Art Enterprise Agents 90× Cheaper with Automated Prompt Optimization research — that automated prompt optimisation lets open-source models surpass frontier-model quality for enterprise tasks, and prompt caching is the serving primitive that makes the long, optimised domain-specific system prompts used by that research economically viable in production. This post is the canonical wiki instance for the implicit, volatile-only, platform-default prompt-caching shape as applied to OSS LLMs on a managed serving platform.

Key takeaways

  1. Implicit caching — the platform decides, not the customer. "The caching is implicit: customers do not need to configure anything, our system has built to automatically run the prompt caching and reuse to improve throughput." This is the load-bearing design choice. Most frontier-model providers expose explicit caching primitives — Anthropic's cache_control blocks, OpenAI's prompt_cache_key, Google's context caches with explicit creation/TTL. Databricks ships the same primitive as a platform-default capability: every request gets the benefit automatically, no SDK changes, no prompt-structure rewrites, no per-request flags. The architectural consequence is that prompt-cache consistency becomes a server-side optimisation responsibility, not a prompt-engineer-side one — though customers who structure their prompts static-prefix-first still benefit more, because hit rates depend on byte-exact prefix matches the system can find. Canonical wiki instance of concepts/implicit-prompt-caching (Source).

  2. Volatile-only, never-persisted cache as the security envelope. "Security is a first-class concern at Databricks. Prompt caches are isolated, only reside in volatile memory and are never persisted." Three orthogonal properties: (a) isolated — tenant boundary on the cache pool (the post does not detail the isolation mechanism, but the claim is that one customer's prompts cannot hit another's cache); (b) volatile-memory only — the cache is RAM-resident, not on disk, not in object storage; (c) never persisted — replica restart wipes it. The composition is the safety envelope that lets Databricks ship caching as a default-on feature on multi-tenant infrastructure: persistence would require a key-management + encryption-at-rest story; volatile isolation sidesteps that entirely. Canonical wiki instance of concepts/volatile-only-prompt-cache-isolation (Source).

  3. 30% cache hit ratio is enough to deliver 2.5× throughput / 3× P50 latency improvement. From the GPT-OSS batch-inference rollout: "Per-replica input-token throughput increased by 2.5x. P50 latency reduced by 3x. All this with a relatively low cache hit ratio of 30%." The disclosed numbers are notable for the asymmetry: only 30% of requests hit the cache, but the latency win is 3× and the throughput win is 2.5×. The structural reason: when the cache hits, the request completely skips the prefill stage (typically the dominant cost for long-prefix prompts), so a 30% hit on a workload where prefill dominates per-request cost moves throughput dramatically. The framing matches the KV-cache amortisation argument: the savings are large per hit because prefill is large; even a low hit rate is high-value because each hit is a full prefill skipped (Source).

  4. Prompt caching is a quality lever, not just a cost lever. "Prompt caching can be a powerful technique to raise a model's quality in specific domains without compromising the model's token throughput. Queries can share a large domain-specific system prompt, with the compute cost of that shared prompt amortized across all those queries. Frontier models, such as Claude, use system prompts that are many thousands of tokens long under the hood. Furthermore, in our recently published research we showed that automated prompt optimization allows open-source models to surpass frontier-model quality for enterprise tasks." The architectural framing inverts the usual one. Without caching, a long domain-specific system prompt is a token-budget tax that pushes teams toward shorter prompts and lower quality. With caching, the long system prompt is prefilled once and reused — the per-request cost looks like the dynamic-suffix cost only — and teams can ship long, optimised prompts that previously wouldn't have paid back. This connects to Databricks' April-2026 automated-prompt-optimisation research: caching is the serving primitive that makes optimisation-discovered long prompts production-viable (Source).

  5. Coverage extended platform-wide and across the OSS-model catalog. Caching applies "to any and all higher-level services powered by a foundation model, e.g., Agent Bricks, Genie, AI Functions" — the FMAPI cache is at the platform substrate, so every product layered on top inherits it. The current OSS-model catalog with caching enabled: GPT-OSS 20B + 120B, Gemma 3 12B, fine-tuned Llama 3.1 8B (via PEFT serving), Llama 3.1 8B, and Llama 3.3 70B. Notable: this is the wiki's first disclosure of Databricks hosting fine-tuned Llama 3.1 8B served via PEFT serving with caching applied — i.e. caching survives the PEFT-adapter layer (the adapter itself is consistent across requests on the same endpoint, so the prompt-prefix KV state is reusable). The post commits to "continue to roll out this feature across our other models" — a platform-default caching pattern, not a per-model integration (Source).

  6. Three workload modes, one cache primitive. "We've now extended this capability to the open-weights models powering our Foundation Model APIs (FMAPIs) for batch inference, pay-per-token, and provisioned-throughput workloads." All three commercial workload modes get caching — the inference engine is shared. The most-quantified rollout is batch inference (the GPT-OSS pipeline cited for the 2.5× / 3× / 30% numbers), where the workload structure is most naturally cache-friendly: a large batch of requests over many documents typically shares a static system / instruction prompt. Pay-per-token (interactive single requests) and provisioned-throughput (dedicated capacity) get caching too, with hit rates determined by traffic shape rather than by per-batch structure (Source).

  7. The motivating cost framing — repeated prefixes are the dominant inefficiency. "Large language model (LLM) inference often involves repeated prompts—think of the same system or instruction prompt appearing in thousands of requests. Reprocessing that identical prefix for every call wastes compute cycles, inflates latency, and increases costs." The structural observation: in production LLM workloads, prompts are not uniformly random — they share large stable prefixes (system prompts, tool catalogs, few-shot examples, retrieval contexts, conversation histories). The prefix is billed and processed per request under naive serving, even though the work is redundant. Caching exploits this redundancy by storing the pre-computed KV-cache tensors for matched prefixes so subsequent requests can skip the prefill stage and begin decoding directly from the cached attention state. The structural speedup is bounded by the prefix-to-suffix ratio of the workload (Source).

Architectural numbers + operational notes (from source)

  • Headline batch-inference numbers (GPT-OSS production pipeline): 2.5× per-replica input-token throughput; 3× P50 latency reduction; 30% cache hit ratio.
  • Caching mode: implicit — no customer configuration required; "automatically run the prompt caching and reuse to improve throughput".
  • Cache placement: volatile memory only; never persisted.
  • Tenant isolation: per-tenant cache pools (mechanism not detailed in the post).
  • Eligible workload tiers: batch inference, pay-per-token, provisioned-throughput.
  • Eligible platform features (caching inherited at no cost): Agent Bricks, Genie, AI Functions, any future service layered on FMAPI.
  • Currently supported OSS models: GPT-OSS 20B
  • 120B; Gemma 3 12B; fine-tuned Llama 3.1 8B (via PEFT serving); Llama 3.1 8B; Llama 3.3 70B.
  • Already shipping for proprietary models (prior to this post): GPT, Gemini, Claude on Databricks. The 2026-05-22 announcement extends parity to the open-weights tier.
  • Forward commitment: "We will continue to roll out this feature across our other models."
  • No disclosures on: cache-key derivation (full prefix vs token- block boundaries), eviction policy (LRU? size-based?), cache size per replica, TTL behaviour, cross-replica or cross-region cache sharing, any pricing differential for cached vs uncached input tokens (compare: Cloudflare's x-session-affinity post explicitly disclosed "discounted cached tokens" as the incentive lever — Databricks' post does not name a billing equivalent), measurement cadence for the headline numbers, baseline workload before the rollout, the hit-ratio distribution across the model catalog.

Systems extracted

New wiki page:

  • systems/databricks-fmapi-prompt-caching — the named platform feature: implicit, volatile-only, multi-tenant prompt caching shipped as a default-on capability of the Foundation Model APIs. Covers batch inference, pay-per-token, and provisioned-throughput workloads. Inherited by every product layered on FMAPI (Agent Bricks, Genie, AI Functions). Currently active on GPT-OSS, Gemma 3, Llama 3.1 / 3.3, and PEFT-served fine-tuned variants.

Extended (cross-link added):

  • systems/databricks-foundation-model-api — adds the implicit- prompt-caching capability as a platform-default substrate property. Reinforces FMAPI as the inference substrate the rest of the Databricks AI product surface layers on.
  • systems/databricks-model-serving — adds prompt caching as a capability inherited from the FMAPI feature for foundation-model endpoints; reinforces the platform-internals depth of this layer.
  • systems/gpt-oss — adds Databricks hosting context (20B + 120B served via FMAPI with prompt caching); the canonical OSS-model rollout that disclosed the 2.5× throughput / 3× P50 / 30% hit-ratio numbers.
  • systems/gemma — adds Databricks hosting context (Gemma 3 12B served via FMAPI with prompt caching).
  • systems/llama-3-1 — adds Databricks hosting context (8B + 70B served via FMAPI with prompt caching, including PEFT-served fine-tuned 8B variants).
  • systems/databricks-genie / systems/databricks-ai-functions — noted as inheritors of FMAPI prompt caching at no integration cost.

Concepts extracted

New wiki pages:

  • concepts/implicit-prompt-caching — the design choice that caching is automatic, platform-decided, and zero-configuration on the customer side. Contrasts with explicit caching primitives exposed by Anthropic (cache_control blocks), OpenAI (prompt_cache_key), or Google (context-cache create/TTL APIs). The architectural lever: caching becomes a substrate property of the serving platform rather than a feature of the request API, shifting cache-hit-rate-engineering responsibility from prompt engineers to the serving system.
  • concepts/volatile-only-prompt-cache-isolation — the multi-tenant security shape for prompt caches: per-tenant isolation + RAM-only residency + no persistence. Composing all three sidesteps the encryption-at-rest / key-management threat model that would otherwise gate caching as a default-on multi-tenant feature.

Extended (cross-link added):

  • concepts/kv-cache — adds Databricks FMAPI implicit caching as another canonical instance of cluster-wide KV-cache reuse, with the specific numerical signature (30% hit ratio → 2.5× throughput / 3× P50) characterising the prefill-skip economics.
  • concepts/prompt-cache-consistency — sibling concept; the Databricks implicit-caching model still rewards prompt-cache- consistent prompt structures, but the caching engine itself is the enforcer of byte-exact prefix matching rather than the application.

Patterns extracted

New wiki page:

  • patterns/implicit-prompt-cache-as-platform-default — the rollout pattern: ship caching as an always-on, zero-configuration substrate property of the serving platform; extend it once, every layered product inherits the throughput / latency / cost win without integration work. Contrasts with the explicit-API pattern where every consumer integrates caching via SDK calls.

Extended (cross-link added):

  • patterns/prompt-cache-aware-static-dynamic-ordering — sibling pattern. With implicit caching, customers no longer have to opt into caching, but prompt structure still matters: byte-exact prefix matches are still the cache hit unit. Static-prefix-first ordering remains the highest-leverage prompt-engineering move even on a platform that caches by default.

Caveats

  • Tier-3 source, short post (~300 words of body content). Marketing framing is heavy in the introduction, but the architectural facts (implicit + volatile + isolated, 30% / 2.5× / 3×, OSS-model catalog, three workload modes) are unambiguous and load-bearing.
  • No disclosure of internal mechanism: cache-key derivation (block-aligned vs full-prefix), eviction policy, cache size per replica, TTL, cross-replica sharing, billing differential — all unspecified. The numbers come from a single named pipeline (GPT-OSS batch inference); cache-hit distribution across model / workload variants is not given.
  • Headline numbers are not A/B-discipline: the post says "we rolled out prompt caching to our GPT-OSS models first and immediately saw measurable gains in one of the large-scale production batch-inference pipelines" — i.e. one production pipeline before/after, not a controlled comparison. Reasonable signal, not causal proof.
  • No comparison to explicit-caching providers' baselines (Anthropic, OpenAI, Google) for hit-ratio / throughput / latency. The architectural-positioning statement ("Databricks already provides built-in prompt caching for proprietary models (GPT, Gemini, Claude)") is a feature-parity claim, not a quantitative comparison.

Source

Last updated · 542 distilled / 1,571 read