Prompt caching is a prefix hash, not memoization
Every agent engineer turns on prompt caching expecting 'same input → same output, cheap'. That's memoization. The Claude prompt cache is prefix hashing with TTL, a 20-block lookback, cascade invalidation, and writes that cost more than regular input. Here's the actual contract — with a six-scenario walker and a live cost calculator.
Step through six API calls. Cold write, warm hits, TTL expiry, cascade invalidation, lookback overflow — the failure modes you'll actually see in production:
- 1system: 'You are a helpful coding agent with these tools…' (10k tokens)▸ breakpoint
- 2user: first question
"Cache" in programming usually means memoization: the same input returns the same pre-computed output at negligible cost. That mental model is wrong for LLM prompt caching. And it's wrong in ways that cost money. Anthropic's prompt cache is prefix hashing with a TTL, a 20-position lookback window, and a write cost 25% higher than regular input. Treating it like a memo table gives you bills that should have been smaller, races you can't debug, and surprise cold-start latency.
This post reads the primary-source contract and walks the scenarios where the cache quietly doesn't do what you think.
Anthropic's prompt cache stores hashes of prompt prefixes, ending at an explicit cache_control breakpoint. A hit requires byte-identical content up to the breakpoint and a matching hash found within a 20-position lookback window. The default TTL is 5 minutes; every hit refreshes it. Writes cost 1.25× base input tokens. Reads cost 0.1×. Changes to tools cascade-invalidate system; changes to system cascade-invalidate messages. At low hit rates, caching can cost more than not caching.
Prefix hashing, not key-value
Memoization says: same key in, stored value out. Prompt caching says: given a prefix identical to a previously-cached one, skip re-reading it. Different things.
From the Anthropic docs:
Cache writes happen only at your breakpoint. Marking a block with
cache_controlwrites exactly one cache entry: a hash of the prefix ending at that block.On each request the system computes the prefix hash at your breakpoint and checks for a matching cache entry. If none exists, it walks backward one block at a time, checking whether the prefix hash at each earlier position matches something already in the cache.
Cache hits require 100% identical prompt segments, including all text and images up to and including the block marked with cache control.
Three things fall out of that:
- Byte identity. A space turned to a tab, a JSON re-serialized with different key order, an image re-encoded with different metadata — any of these changes the hash. The cache sees a new prefix and writes a new entry.
- Only the breakpoint is written. You don't get a cache entry per block. You get one entry per
cache_controlmarker. That's why up to four explicit breakpoints are allowed — that's four separate cache entries of identity you can track. - Lookback walks backwards from the breakpoint. On lookup, the system first tries the hash at your breakpoint. If that's not a hit, it re-hashes the prefix one block shorter and tries again. Up to 20 times.
That 20-block window is the cache's most surprising property. Long-lived conversations age out of the cache not by time alone but by how many messages you've accumulated past the original breakpoint.
The 20-block cliff
From the docs:
The system checks at most 20 positions per breakpoint, counting the breakpoint itself as the first.
In practice: put a cache_control on a 10 kB system prompt. Have a long conversation — 30 turns in, your prompt now has system + 30 accumulated messages at the breakpoint's level. The original cache entry (keyed at breakpoint-position-1) is no longer reachable from breakpoint-position-31. The 20-position walk gives up. Cold miss. Cache gets re-written — at 1.25× input cost — even though the bytes of the prefix didn't change.
The fix in the docs: move your breakpoint. If your static system prompt is block 1 and your conversation grows in blocks 2…31, an additional cache_control somewhere past block 11 is what keeps the cache reachable. Up to four breakpoints are allowed for exactly this reason.
Cascade invalidation
Three layers. Changes cascade downward.
The cache follows:
tools→system→messages. Changes at each level invalidate that level and all subsequent levels.
So a tool-definition change (new tool added, old tool renamed) doesn't just invalidate tools — it invalidates the system layer's cache too, and the messages layer. That's counter-intuitive: the bytes of your system prompt didn't change. But the prefix the hash covers includes tools; when tools shift, the system's hash shifts with it.
The practical consequence is that adding a tool to your agent silently voids every warm cache entry in every running conversation. The next request is a full write at 1.25× cost. If you hot-patch tools frequently (feature flags, per-user tools), you never reach a hit rate that pays for the caching overhead.
Writes cost more than regular input
Memoization is always cheaper — the value was already computed, retrieving it is free. Prompt cache writes are more expensive than not caching:
- Base input: 1× (e.g. $5/MTok for Opus 4.7).
- Cache write (5-minute TTL): 1.25× ($6.25/MTok).
- Cache write (1-hour TTL): 2× ($10/MTok).
- Cache read: 0.1× ($0.50/MTok).
Cache reads are indeed an order of magnitude cheaper than regular input. But each write has to be amortized by enough reads to beat the break-even point. If your hit rate is too low — or your prefix is too small to clear the model's minimum-cacheable threshold — you're just paying 25% extra for no benefit.
Play with the numbers. Cacheable prefix, fresh tokens per call, call volume, hit rate — see where the economics actually flip:
Two observations that jump out of the sliders:
- Low hit rate is a trap. At 30% hit rate with a 10k cached prefix, caching costs more than not caching — the 1.25× writes outweigh the 0.1× reads because you're writing too often.
- Short prefixes don't pay. If your cacheable section is below the model's minimum-cacheable threshold, the
cache_controlmarker is silently ignored. You pay full input price, no warning.
Minimum length is a silent failure mode
From the docs:
Shorter prompts cannot be cached, even if marked with
cache_control. Any requests to cache fewer than this number of tokens will be processed without caching, and no error is returned.
Read that last clause again: no error is returned. Your request that asked for caching of a 2,000-token prompt behaves identically to one without cache_control. The usage metrics don't show a cache hit or a cache write — just input_tokens at base cost. A dashboard that reports "hit rate: 0%" and a dashboard for a silently-ignored cache_control both show the same thing.
The way to catch this: check the cache_creation_input_tokens and cache_read_input_tokens fields in the usage response. If both are zero on a request that should have written or read, your cache_control didn't engage. Common culprits: you marked too short a prefix, or the model you're on has a higher minimum than you expected. Thresholds vary per model — Anthropic's docs list them explicitly; check the current table before assuming.
When caching actually wins
A summary from reading the spec plus a few hundred real-world requests:
- Long, stable system prompts — the canonical case. A 15k-token agent preamble + tool definitions + few-shot examples; user messages are the only thing changing. Every call after the first is a hit at 0.1× cost on the 15k prefix.
- Long document Q&A. User uploads a 40k-token PDF and asks a series of questions. First call writes the document; subsequent questions read at 0.1×. This is one of the highest-ROI caching scenarios in the whole API.
- Agent loops whose preamble outlives a turn. An orchestrator that makes 3–8 tool-call round trips per user request. The agent's preamble + tool schemas are identical across all round trips; caching the preamble at a breakpoint pays the write once per user-task and amortizes across every tool call.
When caching loses money:
- Highly personalized prompts (timestamps at the top of system, per-user instructions in the middle). Any per-user or per-request variation in the prefix — not the suffix — invalidates the cache.
- Low-volume endpoints (< 10 calls / 5min). The write may expire before enough reads arrive to amortize it.
- Frequently-mutated tool schemas. Every mutation cascades; every cascade costs a write.
The mental move: stop thinking about whether your prompt is "cacheable" and start thinking about whether your prefix is stable across the window of calls you're measuring. If the first 15k tokens of your prompt are byte-identical across 50 calls in the next 5 minutes, caching wins. If even one byte in those 15k changes on call 10, the cache writes again, and the next 40 calls have to re-amortize.
Primary sources
- Anthropic — Prompt caching — the canonical contract: breakpoint semantics, 20-position lookback, cascade invalidation, minimum-cacheable thresholds, TTL options, pricing multipliers.
- Anthropic — Tool use / cache interaction — how tools participate in the cascade.
- Anthropic — API reference: Messages — the
cache_controlblock, where it's legal, and the usage fields that report cache activity (cache_creation_input_tokens,cache_read_input_tokens).