Skip to content

PATTERN Cited by 1 source

Prompt layer ordering for cache hits

Intent

Assemble every LLM prompt from most stable to most volatile layers so that the longest possible byte-identical prefix stays unchanged between consecutive calls, maximising prefix-cache hit rates.

Layer ordering

  1. Static system prompt — identical across every run.
  2. Stable session context — organisation, user, timezone, skill instructions; stable for the duration of a session.
  3. Conversation history — grows over time, but earlier turns are immutable once recorded.
  4. Turn-dependent context — current iteration's tool results and reasoning state.

Because each layer changes less frequently than the one after it, the shared prefix between iteration N and iteration N+1 is maximised.

Provider integration

  • OpenAI / Gemini: Implicit prefix cache — the provider automatically detects and reuses matching prefixes; no client-side action needed beyond correct ordering.
  • Anthropic: Explicit cache_control markers placed at the system, stable-context, and last-history boundaries to opt those prefixes into the cache.

Economics

On most iterations in a long-running loop, only the freshest tokens (a tool result + the model's next reasoning step) need processing from scratch. The longer a task runs, the more caching pays off. In a 150-iteration loop, the compounding savings are substantial.

Seen in

Last updated · 542 distilled / 1,571 read