CONCEPT Cited by 1 source
Prompt-cache consistency¶
Definition¶
Prompt-cache consistency is the design constraint of keeping the prefix of a prompt stable across requests — even when parts of it must be dynamic — to preserve prompt-cache hits at the model provider. Cached prefixes skip re-tokenisation and re-encoding, trading a small loss in per-request tailoring for a large reduction in cost and latency.
The mechanism prompt caches rely on¶
Most LLM providers cache the KV-tensors produced by the transformer for a given prompt prefix. A subsequent request whose prompt shares a byte-exact prefix with a cached one skips the prefill for that shared portion and starts generation directly from the cached state. Cache hits are measured at the byte level — a single mutation anywhere in the prefix invalidates the rest.
Canonical Vercel framing¶
"We keep this injection consistent to maximize prompt-cache hits and keep token usage low."
(Source: sources/2026-01-08-vercel-how-we-made-v0-an-effective-coding-agent)
v0's dynamic system prompt is "dynamic between intent classes, stable within an intent class." Every AI-SDK-intent request gets the same version-pinned injection; every frontend-framework-intent request gets a different but equally-stable injection. The cache boundary is the intent class.
The tradeoff¶
A fully dynamic prompt (unique per request) optimises for tailoring at the cost of every request paying the full prefill latency. A fully static prompt caches perfectly but can't adapt to the request. The consistency-within-a-class design splits the difference: one cache slot per class (cheap to populate once) + class-appropriate tailoring.
Design heuristics for cache-friendly dynamic prompts¶
-
Partition the dynamic space into a small number of coarse classes. More classes = more cache slots required = lower hit rate per slot. Vercel's classification is by intent (AI SDK, frontend framework, integration) — a handful of classes, not hundreds.
-
Put the stable content first. A cache hit covers the prefix only; append-only changes downstream preserve the upstream cache. System-prompt-first, then dynamic-injection-second, then user-message- last maximises prefix reuse across user messages.
-
Normalise dynamic content inside a class. If the injection is a templated version-pinned SDK block, pin the template (byte-exact) within a release of the SDK. Don't embed timestamps, request IDs, or per-user data in the cacheable portion.
-
Version the cache key out of band. When you need to invalidate (library release, prompt rewrite), bump a build-version string at the start — forcing a cache miss is a one-time cost; everyone after that hits again.
Failure modes¶
- Per-request tokens in the prefix (user ID, timestamp, request ID) — total cache bust; every request prefills the full prompt.
- Unstable whitespace / JSON field ordering — byte- level mismatches even when the content is logically the same.
- Too-fine-grained dynamic classes — a class per library version fragments the cache; a class per intent with in-class version-pinning hits better.
Related to¶
Distinct from but interacts with concepts/context-engineering — context engineering asks what to put in the prompt; prompt-cache consistency asks how to order and stabilise it so the cache survives.
Seen in¶
- sources/2026-01-08-vercel-how-we-made-v0-an-effective-coding-agent — canonical first-party framing; v0's dynamic-prompt injection is kept consistent specifically to preserve prompt-cache hits.