PATTERN Cited by 1 source
Implicit prompt cache as platform default¶
Pattern¶
Ship LLM-serving prompt caching as a default-on, zero-configuration substrate property of the serving platform — not as an opt-in request-API feature. Every request gets the KV-cache-reuse benefit automatically; every product layered on the platform inherits the win without integration work; the customer SDK surface is unchanged.
Forces¶
- Default-on captures wins customers wouldn't have configured for themselves. A meaningful fraction of LLM workloads have cache-friendly prefix structure (system prompts, tool catalogs, RAG contexts) but the prompt engineer never instrumented an explicit cache API. Default-on caching captures these wins automatically.
- Multi-tenant platforms benefit from substrate-level capabilities that don't require per-customer engagement. Default-on is the highest-leverage rollout shape for a feature that affects all customers similarly.
- Layered products inherit substrate properties: when caching lives at the FMAPI substrate, every higher-level service (Agent Bricks, Genie, AI Functions) inherits caching automatically. Contrast: feature-level integration would need each layer to thread caching through.
- Quality-as-a-cost-lever: default-on caching makes long optimised system prompts economically viable, unlocking quality wins that would otherwise be gated on the prompt engineer knowing about caching.
- The safety envelope must hold by default. Default-on caching on multi-tenant infrastructure requires an isolation story strong enough to hold without per-customer configuration. See concepts/volatile-only-prompt-cache-isolation.
Mechanism¶
- Implement caching at the substrate layer, not the request API. Every request goes through the same cache-aware pipeline; the cache is checked on every prefill.
- Cache key on byte-exact prefix matches (typically token- block-aligned). The platform decides the cache-key boundary; the customer doesn't.
- Compose the safety envelope that holds without customer configuration. The Databricks shape: tenant-isolated + volatile-memory only + never persisted. This is the volatile-only-prompt-cache-isolation composition that lets default-on caching be shippable on multi-tenant infrastructure.
- Inherit downstream: every product layered on the substrate inherits the feature. No per-product integration cost.
- Document the feature as a property, not an API. Customers read about the feature in the platform docs; they do not call different SDK functions to use it.
- Commit to extending across the model catalog. As new models ship, caching is extended without per-model integration work.
Canonical Databricks shape¶
Databricks FMAPI Prompt Caching (GA 2026-05-22):
- Substrate: FMAPI serving infrastructure.
- Customer-facing API surface: unchanged from before the rollout.
- Workload tiers covered: batch inference, pay-per-token, provisioned-throughput.
- Layered products inheriting caching at no integration cost: Agent Bricks, Genie, AI Functions.
- Safety envelope: tenant-isolated + volatile-memory only + never persisted (the [[concepts/volatile-only-prompt-cache- isolation|volatile-only-prompt-cache-isolation]] composition).
- Disclosed numerical signature: 30% cache hit ratio → 2.5× per-replica throughput, 3× P50 latency reduction (GPT-OSS batch-inference rollout).
- Catalog extension: covers GPT-OSS 20B + 120B, Gemma 3 12B, Llama 3.1 8B (including PEFT-served fine-tuned variants), Llama 3.3 70B; commitment to extend to other models without per-model integration.
"The caching is implicit: customers do not need to configure anything, our system has built to automatically run the prompt caching and reuse to improve throughput. ... It also applies to any and all higher-level services powered by a foundation model, e.g., Agent Bricks, Genie, AI Functions." (Source: sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models)
Tradeoffs vs explicit-API caching¶
| Property | Default-on substrate | Explicit API |
|---|---|---|
| Hit-rate ceiling for an engaged customer | Lower (server picks cache boundaries) | Higher (caller pins boundaries) |
| Hit-rate floor for an unengaged customer | Higher (default capture) | Zero (must opt in) |
| Aggregate fleet hit rate | Generally higher | Skewed toward engaged customers |
| Cache invalidation predictability | Lower (server-controlled) | Higher (caller-controlled) |
| Pricing transparency | Typically opaque | Explicit cache-write / cache-read |
| Layered-product inheritance | Free | Per-product integration |
| Multi-tenant safety footprint | Must hold by default | Per-customer-configured |
The pattern is the right choice when the fleet-aggregate win matters more than the per-engaged-customer ceiling — i.e. when the platform owner wants to maximise the average customer's benefit, not the optimised customer's ceiling.
When to prefer this pattern¶
- The serving platform is multi-tenant with a large customer base, most of whom will not engage with caching APIs.
- The serving platform layers multiple products on top of a shared inference substrate; each product would otherwise need to integrate caching separately.
- The workload mix is prefix-heavy — system prompts, tool catalogs, RAG contexts, conversation histories — so even a modest hit ratio yields large per-hit prefill savings.
- A simple safety envelope (volatile-only + tenant-isolation) is acceptable, vs. a more sophisticated cross-replica / persistent caching design that would require per-customer threat-model engagement.
When not to prefer this pattern¶
- Customers need deterministic cache invalidation timing for cost or behaviour reasons — explicit APIs offer this; implicit doesn't.
- Customers need to opt out of caching for compliance reasons on a per-request basis — implicit caching makes opt-out harder to express in the request API.
- The platform needs cross-replica or cross-region cache sharing for the workload to be viable. The volatile-only safety envelope precludes this; an explicit-API design with client-supplied affinity (e.g. x-session-affinity) is structurally better suited.
Relationship to other patterns¶
- patterns/prompt-cache-aware-static-dynamic-ordering — sibling pattern. With default-on substrate caching, customers no longer have to opt into caching — but byte-exact prefix matches still drive hit rate, so static-prefix-first prompt ordering remains the highest-leverage prompt-engineering move.
- concepts/implicit-prompt-caching — the design choice this pattern productionises.
- concepts/volatile-only-prompt-cache-isolation — the safety envelope this pattern's multi-tenant default-on shape depends on.
Seen in¶
- sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models — canonical wiki instance. Databricks ships FMAPI Prompt Caching as a default-on substrate capability; every higher-level service (Agent Bricks, Genie, AI Functions) inherits the benefit without integration work.
Related¶
- systems/databricks-fmapi-prompt-caching
- systems/databricks-foundation-model-api
- systems/databricks-genie
- systems/databricks-ai-functions
- concepts/implicit-prompt-caching
- concepts/volatile-only-prompt-cache-isolation
- concepts/kv-cache
- concepts/session-affinity-prompt-caching
- patterns/prompt-cache-aware-static-dynamic-ordering
- companies/databricks