Skip to content

PATTERN Cited by 1 source

Prompt-cache-aware static/dynamic ordering

Pattern

For batch LLM pipelines where a large static context (system prompt, examples, reference material) precedes a smaller dynamic payload per request, order the prompt static-first, dynamic-last so the provider's prompt cache hits on the entire static prefix for the full batch. Prefill cost is paid once on the first request; subsequent requests in the batch skip it.

Forces

  • Per-request prefill dominates prompt latency for large prompts. A cached prefix can skip prefill entirely and start generating from the cached KV state — latency wins can be dramatic on 40K-token prefixes.
  • Prompt caches key on byte-exact prefix matches. A single mutation anywhere in the prefix invalidates the cache from that point forward.
  • Most providers cache for minutes, not hours. Batch the work in time to keep the cache warm.
  • Bulk code-migration workloads fit this shape naturally — the transformation rules are static, the files being transformed are dynamic.

Mechanism

  1. Partition the prompt into two contiguous regions:
  2. Static prefix: system prompt, role, task description, reference material, examples. Byte-stable across every call in a batch.
  3. Dynamic suffix: per-request payload — in Zalando's case, <file>{file_content}</file>.
  4. Emit the static prefix first, the dynamic suffix second.
  5. Batch calls that share a prefix in time. Cache lifetimes are minutes; spreading a batch across hours recomputes the prefill for each cache miss.
  6. Pin every token in the static region. No timestamps, request IDs, filename preambles, or any other per-request value.
  7. Version the prefix out of band (build-version string, git SHA in a comment). When the prompt needs to change, the version bump is a one-time cache miss; everyone after that hits the new cache.

Canonical Zalando shape

From sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries:

// Static (cacheable across every file in the group)
## Transformation prompt (static)
{transformation_context}
{For each component in group}
  • {interface_details}
  • {mapping_instruction}
  • {examples}

// Dynamic (per file)
## Content to be transformed
<file>
 {file_content}
</file>

For a component group with 30 files to transform, the static prefix is prefill-cached once on the first call and cache-hit on the remaining 29. "Ensuring caching can be leveraged while transforming different files."

Grouping-as-cache-warming

Zalando's grouped component batched migration sub-pattern naturally keeps the cache warm: all files in one component group share one cacheable prefix (the group's interface + mapping + examples), so processing them contiguously stays in-cache. Cross-group context switches bust the cache; if cross- group work is necessary, it happens at the end.

Contrast with sibling pattern

patterns/dynamic-knowledge-injection-prompt (Vercel v0) achieves the same cache-hit discipline by partitioning the dynamic space into coarse intent classes — each class has a stable injection. This pattern achieves it by extracting a genuinely-static section (nothing dynamic in the prefix) and putting it first. Same goal, different shape: one for agents where every request has some per-request dynamic content, one for batch pipelines where the only dynamic content is the payload itself.

Consequences

Positive:

  • Cost savings scale with batch size. For a 30-file group with a 45K prompt, the per-file cost drops by a factor of (N-1)/N roughly, modulo output tokens.
  • Latency reduction. Prefill on 45K tokens takes seconds; cached hits take tens of milliseconds.
  • Cache-friendly development. Once the shape is set, prompt tweaks during development are isolated to specific sections — versioning the prefix lets you invalidate one slot without nuking others.

Negative:

  • Cache is provider-managed, not client-visible. Zalando can't inspect cache-hit rate directly; they rely on per-request billing to infer it.
  • Cache lifetimes are short. Minutes, not hours. Long-running batches may miss across pauses.
  • Prompt-development cost bust. Every prompt change during development is a miss; temperature=0 + regression tests keep development disciplined.
  • Byte-level fragility. Template interpolation has to produce byte-exact output every time — a trailing newline that appears sometimes is a cache bust.

Seen in

Last updated · 501 distilled / 1,218 read