CONCEPT Cited by 2 sources

Prompt-cache consistency¶

Definition¶

Prompt-cache consistency is the design constraint of keeping the prefix of a prompt stable across requests — even when parts of it must be dynamic — to preserve prompt-cache hits at the model provider. Cached prefixes skip re-tokenisation and re-encoding, trading a small loss in per-request tailoring for a large reduction in cost and latency.

The mechanism prompt caches rely on¶

Most LLM providers cache the KV-tensors produced by the transformer for a given prompt prefix. A subsequent request whose prompt shares a byte-exact prefix with a cached one skips the prefill for that shared portion and starts generation directly from the cached state. Cache hits are measured at the byte level — a single mutation anywhere in the prefix invalidates the rest.

Canonical Vercel framing¶

"We keep this injection consistent to maximize prompt-cache hits and keep token usage low."

(Source: sources/2026-01-08-vercel-how-we-made-v0-an-effective-coding-agent)

v0's dynamic system prompt is "dynamic between intent classes, stable within an intent class." Every AI-SDK-intent request gets the same version-pinned injection; every frontend-framework-intent request gets a different but equally-stable injection. The cache boundary is the intent class.

The tradeoff¶

A fully dynamic prompt (unique per request) optimises for tailoring at the cost of every request paying the full prefill latency. A fully static prompt caches perfectly but can't adapt to the request. The consistency-within-a-class design splits the difference: one cache slot per class (cheap to populate once) + class-appropriate tailoring.

Design heuristics for cache-friendly dynamic prompts¶

Partition the dynamic space into a small number of coarse classes. More classes = more cache slots required = lower hit rate per slot. Vercel's classification is by intent (AI SDK, frontend framework, integration) — a handful of classes, not hundreds.
Put the stable content first. A cache hit covers the prefix only; append-only changes downstream preserve the upstream cache. System-prompt-first, then dynamic-injection-second, then user-message- last maximises prefix reuse across user messages.
Normalise dynamic content inside a class. If the injection is a templated version-pinned SDK block, pin the template (byte-exact) within a release of the SDK. Don't embed timestamps, request IDs, or per-user data in the cacheable portion.
Version the cache key out of band. When you need to invalidate (library release, prompt rewrite), bump a build-version string at the start — forcing a cache miss is a one-time cost; everyone after that hits again.

Failure modes¶

Per-request tokens in the prefix (user ID, timestamp, request ID) — total cache bust; every request prefills the full prompt.
Unstable whitespace / JSON field ordering — byte- level mismatches even when the content is logically the same.
Too-fine-grained dynamic classes — a class per library version fragments the cache; a class per intent with in-class version-pinning hits better.

Distinct from but interacts with concepts/context-engineering — context engineering asks what to put in the prompt; prompt-cache consistency asks how to order and stabilise it so the cache survives.

Seen in¶

sources/2026-01-08-vercel-how-we-made-v0-an-effective-coding-agent — canonical first-party framing; v0's dynamic system prompt is kept consistent specifically to preserve prompt-cache hits.
sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries — bulk-code-migration instance at a different altitude. Zalando's Component Migration Toolkit structures its prompt with the static portion (system + interface + mapping + examples for a component group) at the top and the dynamic portion (<file>{file_content}</file>) at the end: "we set up a structured prompt format that maximized cache hits … ensuring caching can be leveraged while transforming different files." Where Vercel partitions the dynamic space into intent classes to preserve within-class stability, Zalando extracts a genuinely-static section and runs it across a full batch of files. See concepts/static-dynamic-prompt-partitioning for the ordering primitive and patterns/prompt-cache-aware-static-dynamic-ordering for the pattern form.