Context Engineering
04·Context Engineering·updated 2026-04-19

Prompt caching

API-level exposure of the inference engine's KV cache: you pay once for the static prefix (tool defs, system prompt, project context) then read it back at 0.1x input price on all following requests. Claude Code shows 92% hit-rate and -81% cost on a session. It is not a toggle, it is an architectural discipline.

Prompt caching

TL;DR

API-level exposure of the inference engine's KV cache: you pay once for the static prefix (tool defs, system prompt, project context) then read it back at 0.1x input price on all following requests. Claude Code shows 92% hit-rate and -81% cost on a session. It is not a toggle, it is an architectural discipline.

The historical problem

An LLM agent sends back the FULL conversation history on every turn: system prompt, tool definitions, project context already read 3 turns ago, plus the new messages. Without a server-side cache, everything is retokenized and re-attended every turn.

Order of magnitude: a 20k token system prompt over 50 turns = 1M tokens of redundant compute, billed at full price, producing zero new value. On a production agent fleet, it is often the first cost line.

The fix existed server-side for a long time through the kv cache (vLLM PagedAttention, SGLang RadixAttention). But as long as the API did not expose it, the client still paid for the recompute. Anthropic (August 2024) then OpenAI (October 2024) opened this mechanism to the end user, with explicit pricing.

How it works

Static vs dynamic split

Every agent request has two parts:

Static vs dynamic split

  • Static prefix identical every turn: system instructions, tool definitions, project context, behavioral rules
  • Dynamic suffix that grows: user messages, assistant answers, tool outputs, terminal observations

This split is what makes caching possible. The infra stores the mathematical state of the prefix and lets following requests with the exact same prefix read it from memory instead of recomputing.

The 2 inference phases (reminder)

Prefill vs decode

  • Prefill phase: processes the whole input prompt. Dense matmul on all tokens to build the internal representation. Compute-bound, expensive.
  • Decode phase: generates tokens one by one. Each new token attends to the past. Memory-bound.

For each token in prefill, the Transformer computes 3 vectors: Query, Key, Value. K and V of a token only depend on tokens before it. Once computed, they never change. See kv cache for details.

Hash-based caching

Without cache: K/V thrown away after each request, full recompute on the next call.

With cache:

  1. K/V of the static prefix are persisted on the inference server, indexed by a cryptographic hash of the token sequence
  2. New request comes in -> hash the prefix -> if match, tensors loaded from memory, prefill skipped for these tokens
  3. Only the new tokens (dynamic suffix) are prefilled

Complexity: from O(n^2) per generated token to O(n). On a 20k token prefix repeated over 50 turns, huge reduction.

Economics (Anthropic, April 2026)

Pricing table

  • Cache read: 0.1x base input price = -90% per cached token
  • Cache write: 1.25x = +25% to write the K/V tensors
  • Extended cache (1h TTL): 2.0x

The math only works if the hit-rate stays high. Typical break-even: >=2 reads to amortize the write.

Claude Code: 92% hit-rate case study

Claude Code session cost

30 min coding session with Claude Code:

  • Minute 0: system prompt + tool defs + CLAUDE.md > 20k tokens. All new = most expensive moment of the session, but paid once.
  • Min 1-5: Explore Subagent navigates the code, greps, reads files. Appends to the dynamic suffix. The 20k prefix -> cached at $0.30/MTok instead of $3.00/MTok.
  • Min 6-15: Plan Subagent receives a summary brief (not the raw output, avoids bloating). Produces a plan. Every turn re-reads the prefix from the cache. Hit rate > 90%. Each hit resets the TTL (cache stays warm).
  • Min 16-25: Changes, more tool calls, more terminal output. Hundreds of thousands of tokens processed, but the 20k prefix stays in cache.
  • Min 28: /cost shows 2M tokens total. Without caching: $6. With 92% hit-rate: 1.84M in cache reads -> $1.15. -81% on the task.

Hash-based caching fragility

The most counter-intuitive thing: "1 + 2 = 3" hit, but "2 + 1" is a cache miss.

Hash fragility

The infra hashes the full sequence from the start. The smallest change (even the order of 2 elements) changes the hash and invalidates the full prefix -> full-price recompute.

Real production examples that broke the cache:

  • A timestamp injected in the system prompt -> unique hash every request
  • A JSON serializer sorting tool schema keys differently between requests -> invalidation
  • An AgentTool whose params were updated mid-session -> wipe of the 20k cached tokens

3 rules that follow from this:

  1. Do not modify tools during a session. Tool defs are part of the cached prefix. Add or remove a tool and you invalidate everything downstream.
  2. Never switch models mid-session. Caches are model-specific. Switching to a cheaper model mid-conversation = full rebuild.
  3. Never mutate the prefix to update state. Instead of editing the system prompt, Claude Code appends a reminder tag to the next user message to keep the prefix intact.

Relevance today (2026)

Prompt caching has become table stakes for any serious agent stack:

  • Anthropic: auto-caching on the API, breakpoint automatically advanced as the conversation grows. Gemini 2.5 and OpenAI also have their version.
  • 1M context Claude makes prompt caching even more critical: without it, an agent with full context pays $3/MTok x 1M = $3 per turn.
  • Client ecosystem: frameworks (LangGraph, Anthropic SDK, Vercel AI SDK) support explicit or auto cache breakpoints more and more.
  • Cross-session cache: Anthropic introduced the 1h-TTL (2.0x write) which keeps the cache warm across distinct users -> critical for multi-tenant SaaS.
  • Tension with context compression: you can no longer aggressively summarize history every turn without breaking the cache. New primitive: cache-safe forking (see next section).

Question to ask: is your agent stack designed around caching, or is caching an add-on ? If it is the latter, you leave 70-90% of your margin on the table.

Apply the pattern to your own agent

Prompt structure

Order of an optimized agent prompt:

  1. System instructions + behavioral rules first. Immutable during the session.
  2. Tool definitions loaded upfront. No addition or removal mid-session.
  3. Retrieved context / reference docs next. Stable for the session duration.
  4. Conversation history + tool outputs last = dynamic suffix.

Cache-safe forking for compaction

Cache-safe forking

When approaching the context limit:

  • Keep the same system prompt, tools, conversation history
  • Append the compaction instruction as a new message
  • The prefix stays hit, only the new instruction is billed

Anti-pattern: rewriting the summarized conversation = new prefix = full miss.

Monitoring

3 fields to track in every API response:

  • cache_creation_input_tokens: written to the cache
  • cache_read_input_tokens: served from the cache
  • input_tokens: processed without caching

Cache efficiency = cache_read / (cache_read + cache_creation).

Track it like you track uptime. Alert if hit-rate drops (signals a prefix stability regression).

Critical questions

  • Why an exact hash and not a more tolerant prefix match ? Answer: latency. Hash = O(1) lookup. Fuzzy prefix matching = trade-off infra folks did not want to make.
  • If my agent has 100 simultaneous users with the same system prompt, do they share the cache ? Answer: depends on the provider (Anthropic yes on the static prefix, isolation by org).
  • Explicit cache breakpoint vs auto: when to use which ? Auto for most cases, explicit when you have edge cases (multiple static segments) or need fine control.
  • How many breakpoints max per prompt ? Anthropic: 4 in April 2026. Use them strategically (e.g.: 1 = tools, 2 = system, 3 = retrieved docs, 4 = rolling suffix).
  • When is the 1h extended cache worth it ? Break-even: write cost 2.0x vs 1.25x -> the gains on multiple reads within 1h must exceed the diff. Typically yes as soon as >3 users hit the same prefix within the hour.
  • If the TTL expires (5 min by default), do we need to rebuild ? Yes but every hit resets the TTL -> an active agent stays warm indefinitely.
  • How to test caching impact in dev without budget ? Watch cache_read_input_tokens in API responses, no need for a perf benchmark.

Production pitfalls

  • Timestamp in the system prompt: silent killer. Every request = unique hash = 0% hit rate. First thing to audit when cache efficiency < 50%.
  • Non-deterministic JSON schemas: serializer reordering keys -> hash changes. Force alpha sort of keys.
  • Dynamically generated tool defs: if you build your tools from a registry that changes order based on registration order -> invalidation. Freeze the order at startup.
  • Wrongly placed breakpoint: if the cache breakpoint falls in the middle of the dynamic suffix, the boundary moves every turn -> miss. Place the breakpoint at the end of the static zone.
  • Mid-session model switching: a router flipping Haiku -> Sonnet after N turns breaks the whole cache.
  • Mutating the system prompt to update state (e.g.: "user is now in premium mode") -> invalidates. Pass the info in the next user message.
  • Cold start: first request without cache = expensive. Budget it separately for cost-per-session metrics.
  • Naive compaction: replacing history with a summary = new prefix = miss. Use cache-safe forking.
  • Cache invalidation from a middleware bug: a middleware adding a header logging the request-id to the prompt body = permanent wipe. To audit.
  • Multi-region: caches are region-specific. A load balancer that shuffles across regions breaks the hit rate.

Alternatives / Comparisons

Providers (April 2026)

ProviderAuto cacheExplicit breakpointRead discountTTL defaultNote
Anthropic (Claude)YesYes (max 4)0.1x5 min, extended 1h (2x write)Most mature, clearest API
OpenAIAuto onlyNo0.5x GPT-4o, 0.25x GPT-4o-mini5-10 minLess control, less aggressive discount
Google GeminiYes (explicit + implicit)Yes~0.25x5 minImplicit by default since 2.5
xAI GrokYesNo0.25x5 minMore recent
Amazon BedrockDepends on the modelYes on ClaudeInherits from provider-Exposes the feature upstream

Self-hosted side

For self-hosted stacks, caching is handled by the inference server (not the API):

  • vLLM: PagedAttention + automatic prefix sharing. See vllm.
  • SGLang: RadixAttention, excellent on tree-based prefix sharing (agents with branches).
  • TGI (HuggingFace): partial prefix caching.
  • LMCache: tiered (GPU -> CPU RAM -> disk) for massive contexts.

Versus alternatives

  • Context compression (LLMLingua, summarization): reduces token count but breaks the cache because it changes the prefix. Combine carefully (compress once then keep stable).
  • RAG retrieval every turn: if retrieved docs change every turn, the prefix is not stable -> no cache. Strategy: retrieve in batch at session start, frozen for the rest.
  • Fine-tuning: replaces the stable context with model-internal knowledge. Removes the need for caching but expensive upfront.

Mini-lab

To create: /lab prompt-caching

Learning goals:

  • Build a tool-use agent with the Anthropic SDK
  • Measure hit rate without a cache break-point
  • Add explicit cache_control on system prompt and tool defs
  • Compare latency + cost on 20 turns with and without cache
  • Intentionally break the cache (timestamp, key reorder) and observe the impact
  • Bonus: implement cache-safe compaction (fork with new instruction)

Further reading

prompt-cachingkv-cacheinferenceoptimizationcostagents