Skip to content

CONCEPT Cited by 1 source

Prompt-cache consistency

Definition

Prompt-cache consistency is the design constraint of keeping the prefix of a prompt stable across requests — even when parts of it must be dynamic — to preserve prompt-cache hits at the model provider. Cached prefixes skip re-tokenisation and re-encoding, trading a small loss in per-request tailoring for a large reduction in cost and latency.

The mechanism prompt caches rely on

Most LLM providers cache the KV-tensors produced by the transformer for a given prompt prefix. A subsequent request whose prompt shares a byte-exact prefix with a cached one skips the prefill for that shared portion and starts generation directly from the cached state. Cache hits are measured at the byte level — a single mutation anywhere in the prefix invalidates the rest.

Canonical Vercel framing

"We keep this injection consistent to maximize prompt-cache hits and keep token usage low."

(Source: sources/2026-01-08-vercel-how-we-made-v0-an-effective-coding-agent)

v0's dynamic system prompt is "dynamic between intent classes, stable within an intent class." Every AI-SDK-intent request gets the same version-pinned injection; every frontend-framework-intent request gets a different but equally-stable injection. The cache boundary is the intent class.

The tradeoff

A fully dynamic prompt (unique per request) optimises for tailoring at the cost of every request paying the full prefill latency. A fully static prompt caches perfectly but can't adapt to the request. The consistency-within-a-class design splits the difference: one cache slot per class (cheap to populate once) + class-appropriate tailoring.

Design heuristics for cache-friendly dynamic prompts

  1. Partition the dynamic space into a small number of coarse classes. More classes = more cache slots required = lower hit rate per slot. Vercel's classification is by intent (AI SDK, frontend framework, integration) — a handful of classes, not hundreds.

  2. Put the stable content first. A cache hit covers the prefix only; append-only changes downstream preserve the upstream cache. System-prompt-first, then dynamic-injection-second, then user-message- last maximises prefix reuse across user messages.

  3. Normalise dynamic content inside a class. If the injection is a templated version-pinned SDK block, pin the template (byte-exact) within a release of the SDK. Don't embed timestamps, request IDs, or per-user data in the cacheable portion.

  4. Version the cache key out of band. When you need to invalidate (library release, prompt rewrite), bump a build-version string at the start — forcing a cache miss is a one-time cost; everyone after that hits again.

Failure modes

  • Per-request tokens in the prefix (user ID, timestamp, request ID) — total cache bust; every request prefills the full prompt.
  • Unstable whitespace / JSON field ordering — byte- level mismatches even when the content is logically the same.
  • Too-fine-grained dynamic classes — a class per library version fragments the cache; a class per intent with in-class version-pinning hits better.

Distinct from but interacts with concepts/context-engineering — context engineering asks what to put in the prompt; prompt-cache consistency asks how to order and stabilise it so the cache survives.

Seen in

Last updated · 476 distilled / 1,218 read