Skip to content

CONCEPT Cited by 1 source

Implicit prompt caching

Definition

Implicit prompt caching is the design choice that an LLM-serving platform performs KV-cache reuse across requests automatically, without requiring the caller to opt in, configure cache keys, set TTLs, or change SDK calls. The platform decides what to cache, when, and for how long; the caller's request is byte-for-byte identical to a request on a non-caching platform.

Contrast: explicit prompt caching, where the API exposes caching as a first-class primitive the caller manipulates — Anthropic's cache_control blocks within message content, OpenAI's prompt_cache_key request parameter, Google's context-cache create / TTL APIs.

Canonical wiki instance — Databricks FMAPI

Databricks FMAPI Prompt Caching (GA on open-weights models 2026-05-22) is the canonical wiki instance:

"The caching is implicit: customers do not need to configure anything, our system has built to automatically run the prompt caching and reuse to improve throughput." (Source: sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models)

The customer-facing API is unchanged across the rollout — every existing FMAPI consumer (including Agent Bricks, Genie, AI Functions) inherited the caching benefit at no integration cost.

Forces that favour implicit over explicit

  1. Many caching wins go uncaptured under explicit caching because the prompt engineer has to know the API exists, learn the semantics, and instrument every call. Implicit caching captures the wins the caller would have missed.
  2. Cache-aware prompt structure is a server-side optimisation responsibility, not a caller-side one. The server has more information (live hit rates across the fleet, replica residency, eviction pressure) than any individual caller. Implicit caching moves the optimisation surface to where the information lives.
  3. Multi-tenant platforms benefit from default-on capabilities because the marginal customer doesn't engage with a knob. Default-on caching means the aggregate cache hit rate across the fleet is high even if no individual customer optimised for it.
  4. Caching is a quality lever, not just a cost lever (the Databricks framing) — long, optimised system prompts only pay back economically with caching. Implicit caching means quality gains are also default-on, not gated on opt-in.

Forces that favour explicit over implicit

  1. Caller controls cache invalidation timing. With explicit APIs, the caller can pin a specific prefix as cacheable and reason about cost / behaviour deterministically. Implicit caching trades determinism for hit-rate-by-default.
  2. Caller can structure prompts to match cache boundaries. Anthropic's cache_control blocks let the caller say "this block is the cacheable prefix; the rest is dynamic." This is especially useful when the prefix and suffix have unusual shapes that the server can't infer.
  3. Pricing transparency. Explicit caches can be billed at a discount (Cloudflare's "discounted cached tokens" under x-session-affinity, Anthropic's cache-write/cache-read pricing). Implicit caches typically don't expose a billing differential, so the cost win is invisible to the caller.

Hit-rate dependence on prompt structure

Even under implicit caching, prompts that share byte-exact prefixes hit the cache more often than prompts that don't. Implicit caching shifts the responsibility for [[concepts/prompt-cache- consistency|prompt-cache consistency]] to the server (the server defines the cache-key boundary, typically token-block-aligned prefix matching), but the caller's prompt structure still determines whether prefixes can match. The static- prefix-first ordering pattern remains the highest-leverage prompt-engineering move on an implicit-caching platform — it just no longer requires opting into a cache API.

Relationship to other caching primitives

  • concepts/kv-cache — the underlying transformer-inference primitive that prompt caching reuses. Implicit caching is a policy choice about when to expose KV-cache reuse to whom.
  • concepts/session-affinity-prompt-caching — explicit caching via a client-supplied header (x-session-affinity). Cloudflare's design hands the caller a routing-affinity hint — callable choose to send it; Databricks' design ships the hint inside the platform.
  • concepts/prompt-cache-consistency — the prompt-engineering discipline that maximises hit rates. Survives the implicit / explicit choice unchanged; the caller's prompt structure still matters.

Seen in

Last updated · 542 distilled / 1,571 read