Prompt Caching: How Anthropic, OpenAI, and Google Cut LLM Costs by 90%

Prompt caching reuses the KV tensors from identical prompt prefixes across API requests, cutting input token costs by up to 90% and latency by up to 80%. This guide covers how Anthropic, OpenAI, and Google each implement it, exact pricing per provider, implementation patterns for coding agents, and how to stack caching with token compaction for maximum savings.

April 5, 2026 · 3 min read

Prompt caching reuses the KV tensors from identical prompt prefixes across API requests instead of recomputing them from scratch. The result: up to 90% cheaper input tokens and up to 80% lower time-to-first-token latency. Every major LLM provider now supports it.

For coding agents, this matters more than for any other workload. Agents resend the same system prompt, tool definitions, and conversation history on every turn. The delta between consecutive API calls is often just a few lines of new content. Without caching, you pay full price to reprocess 10,000+ tokens of identical content on every single request.

This guide covers how each provider implements prompt caching, exact pricing, implementation patterns that maximize cache hit rates, specific optimizations for coding agents, and how to stack caching with token compaction for compounding savings.

90%
Max input token cost reduction
80%
Max latency reduction (TTFT)
0.1x
Anthropic cache read multiplier
45-80%
Measured savings in agent workloads

What Is Prompt Caching

LLM inference has two phases: prefill and decode. During prefill, the model processes the entire input prompt, computing Query, Key, and Value tensors across all transformer layers. This produces the first output token. During decode, the model generates tokens one at a time, using the KV tensors from prefill to attend back to the input.

Prefill is the expensive phase. It scales with prompt length, involves dense matrix multiplications across every layer, and runs once before any output appears. On a 10,000-token prompt, prefill dominates the time-to-first-token measurement.

Prompt caching exploits a property of causal attention: the KV tensors for token N depend only on tokens 1 through N. If two requests share the same first 10,000 tokens, the KV tensors for those tokens are identical. Recomputing them is pure waste.

With caching, the provider stores the KV tensors from a previous prefill. When a new request arrives with the same prefix, it loads the stored tensors instead of recomputing them, then runs prefill only on the new tokens after the cached prefix. The latency savings are proportional to the fraction of the prompt that hits cache. The cost savings come from providers charging less for cached reads than fresh computation.

KV cache vs. prompt cache

KV cache is the general mechanism: storing Key-Value tensors to avoid recomputation during autoregressive decoding within a single request. Every LLM inference engine uses this internally.

Prompt cache (or prefix cache) extends this across requests. It stores the KV tensors from one request's prefill phase and serves them to subsequent requests that share the same prefix. This is the API-level feature that providers expose.

How KV Cache Reuse Works

The implementation uses PagedAttention as its foundation. Instead of allocating one contiguous GPU memory block per request (which wastes ~70% of memory to fragmentation), the system breaks the KV cache into fixed-size blocks, typically 16 tokens each. Each block gets a hash based on its content and the hash of its parent block.

Prefix cache hash chain

# Each KV cache block is hashed with its parent:
hash(block_0) = sha256(tokens[0:16], metadata)
hash(block_1) = sha256(hash(block_0), tokens[16:32], metadata)
hash(block_2) = sha256(hash(block_1), tokens[32:48], metadata)

# If block_2's hash matches, blocks 0 and 1 are
# guaranteed identical. No need to verify each one.

# When a new request arrives:
# 1. Compute hashes for each block of the new prompt
# 2. Walk forward until a hash miss occurs
# 3. Load cached KV tensors for all matching blocks
# 4. Run prefill only on blocks after the miss

The hash chain is the key insight. Because each block's hash includes its parent's hash, a match at block N guarantees matches for blocks 0 through N-1. The system walks forward through hashes until it hits a miss, loads all cached blocks up to that point, and only computes the remainder. This is why static content must come first in your prompt: any change in the prefix invalidates the entire chain after that point.

Multiple requests can share the same cached blocks simultaneously. Reference counting prevents premature deallocation. When the last request using a block finishes, the block returns to the free pool. This is why steady request streams maintain higher cache hit rates: the blocks stay allocated.

Why causal attention makes prefix caching possible

In causal (autoregressive) attention, token N can only attend to tokens 1 through N. The KV tensors for token N depend exclusively on the prefix up to position N. Changing any token before position N invalidates the KV tensors for N and everything after it. This is why only contiguous prefixes from the start of the prompt can be cached, and why providers require static content at the beginning.

Anthropic: Explicit Cache Control

Anthropic gives you direct control over cache boundaries through the cache_control parameter. You mark specific content blocks as cacheable, and the system caches the prompt prefix up to that point. This explicit approach lets you control exactly what gets cached and at what TTL.

Two Caching Modes

Automatic caching (recommended for conversations): add a single cache_control field at the top level of your request. The system automatically manages breakpoints, moving the cache boundary forward as the conversation grows.

Anthropic automatic caching

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  cache_control: { type: "ephemeral" },  // Auto-manages breakpoints
  system: "You are an expert code reviewer...", // 5,000+ tokens
  messages: [
    { role: "user", content: "Review this pull request..." }
  ],
});

// Turn 1: Full prompt written to cache
// Turn 2: System prompt read from cache; new messages written
// Turn 3: Previous turns read from cache; latest turn written
// Cache breakpoint automatically advances each turn

Explicit breakpoints (fine-grained control): place cache_control directly on individual content blocks. Up to 4 breakpoints per request. Use this when different parts of your prompt change at different frequencies.

Anthropic explicit breakpoints with mixed TTLs

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: "You are an AI coding assistant...",  // Stable for hours
      cache_control: { type: "ephemeral", ttl: "1h" },
    },
    {
      type: "text",
      text: toolDefinitions,  // Changes when tools update
      cache_control: { type: "ephemeral" },  // 5-min default
    },
  ],
  messages: conversationHistory,
});

// Billing breakdown:
// - 1h cache write: 2x base input price (system prompt)
// - 5m cache write: 1.25x base input price (tools)
// - Cache reads: 0.1x base input price (both)

TTL Options

TTLWrite CostRead CostBest For
5 minutes (default)1.25x base input0.1x base inputActive conversations, rapid iteration
1 hour2x base input0.1x base inputShared system prompts, knowledge bases

The math on when each TTL pays off: a 5-minute cache write costs 1.25x, and reads cost 0.1x. After one read (total: 1.25 + 0.1 = 1.35x vs. 2.0x without caching), you are saving money. A 1-hour cache write costs 2x. After two reads (total: 2.0 + 0.2 = 2.2x vs. 3.0x without caching), you break even.

Minimum Cacheable Tokens

ModelMinimum TokensNotes
Claude Opus 4.6 / Opus 4.5 / Haiku 4.54,096 tokensLarger prompts required
Claude Sonnet 4.6 / Haiku 3.52,048 tokensMedium threshold
Claude Sonnet 4.5 / Opus 4.1 / Opus 4 / Sonnet 41,024 tokensLowest threshold

Cache Invalidation

Anthropic's cache follows a strict hierarchy: toolssystemmessages. Changing anything earlier in the hierarchy invalidates everything after it. Changing tool definitions invalidates system and message caches. Changing system prompts invalidates message caches. Changing thinking parameters or images in the system also invalidates message caches.

Monitoring cache performance

The response usage object includes three fields: cache_creation_input_tokens (tokens written to cache), cache_read_input_tokens (tokens served from cache), and input_tokens (tokens processed fresh after the last breakpoint). If both cache fields are 0, caching did not activate, likely because your prompt is below the minimum threshold.

OpenAI: Automatic Prefix Caching

OpenAI's approach is the opposite of Anthropic's: fully automatic, no configuration. Any prompt over 1,024 tokens is eligible. The system identifies the longest cached prefix, serves it, and processes only the new suffix. You see the savings in the usage.prompt_tokens_details.cached_tokens field.

OpenAI automatic prefix caching

import OpenAI from "openai";

const client = new OpenAI();

// No special configuration needed. Just send requests.
const response = await client.chat.completions.create({
  model: "gpt-5.4",
  messages: [
    { role: "system", content: longSystemPrompt },  // Cached automatically
    { role: "user", content: "What does the auth middleware do?" },
  ],
});

// Check cache hit in response:
// response.usage.prompt_tokens_details.cached_tokens
// Shows how many tokens were served from cache

How It Works Internally

Requests are routed based on a hash of the initial prompt prefix (typically first 256 tokens). This routes the request to a server that recently processed a similar prompt, increasing cache hit probability. The system then checks for the longest matching prefix in 128-token increments, starting from the 1,024-token minimum.

An optional prompt_cache_key parameter lets you influence routing. If you have multiple request streams with different system prompts, giving each stream a consistent cache key improves hit rates by ensuring similar prompts land on the same server.

Cache Retention

TierDurationMechanismModels
In-memory5-10 min (up to 1 hour)Standard GPU RAMAll supported models
ExtendedUp to 24 hoursGPU-local storage offloadGPT-5.4, GPT-5.2, GPT-5.1, GPT-5, GPT-4.1

Pricing

OpenAI charges no write premium for caching. Cached tokens simply cost less. The discount varies by model generation:

ModelInput $/MTokCached Input $/MTokDiscount
GPT-5.4$2.50$0.2590%
GPT-5.4-mini$0.75$0.07590%
GPT-5.4-nano$0.20$0.0290%

The no-write-premium model means OpenAI caching is profitable from the very first cache hit. No break-even calculation needed. If any portion of your prompt is cached, you save money immediately. The trade-off is less control: you cannot set TTLs, force cache persistence, or place explicit breakpoints.

Google: Implicit and Explicit Caching

Google offers both approaches. Implicit caching is automatic and free, enabled by default on Gemini 2.5 and newer. Explicit caching gives you manual control with configurable TTLs and guaranteed cost savings, but charges for storage.

Implicit Caching

Since May 2025, all Gemini 2.5+ models automatically cache prompt prefixes. No developer configuration. When your request hits a cached prefix, you get the discounted rate. When it misses, you pay standard input pricing. No storage costs for implicit caching.

Explicit Caching

You explicitly declare content to cache and set a TTL (default: 1 hour, no minimum or maximum bounds). The cached content acts as a prefix to subsequent prompts. Unlike Anthropic, Google charges for storage duration in addition to the read discount.

Google Gemini explicit caching

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GOOGLE_API_KEY });

// Create a cache with explicit TTL
const cache = await ai.caches.create({
  model: "gemini-2.5-pro",
  config: {
    contents: [
      {
        role: "user",
        parts: [{ text: largeCodebaseContext }],  // 50,000+ tokens
      },
    ],
    systemInstruction: "You are a senior code reviewer...",
    ttl: "3600s",  // 1 hour
  },
});

// Use the cache in subsequent requests
const response = await ai.models.generateContent({
  model: "gemini-2.5-pro",
  contents: "What are the security vulnerabilities in this code?",
  config: { cachedContent: cache.name },
});

Pricing by Model

ModelStandard InputCached ReadStorage (per hour)Discount
Gemini 2.5 Pro$1.25$0.125$4.50/MTok/hr90%
Gemini 2.5 Flash$0.30$0.03$1.00/MTok/hr90%
Gemini 3.1 Pro Preview$2.00$0.20$4.50/MTok/hr90%
Gemini 3 Flash Preview$0.50$0.05$1.00/MTok/hr90%

Storage costs matter for explicit caching

Google is the only major provider that charges per-hour storage for explicit caches. A 100,000-token cache on Gemini 2.5 Pro costs $0.45/hour to store. If you are not reading from it frequently enough, the storage cost can exceed the savings. Implicit caching has no storage cost and is usually the better default.

Minimum Token Thresholds

ModelMinimum Tokens
Gemini 2.5 Flash / 3 Flash1,024 tokens
Gemini 2.5 Pro / 3 Pro4,096 tokens

Pricing Comparison Across Providers

The three providers have fundamentally different pricing models for caching. Anthropic charges a write premium but offers the cheapest reads. OpenAI charges nothing extra for writes. Google charges for storage time. The right choice depends on your access pattern.

DimensionAnthropic (Claude)OpenAI (GPT)Google (Gemini)
ConfigurationExplicit cache_controlFully automaticImplicit (auto) + Explicit (manual)
Write cost1.25x (5min) / 2x (1hr)No surchargeStandard input price
Read discount90% off (0.1x)50-90% off (model-dependent)75-90% off (model-dependent)
TTL control5 min or 1 hourNo control (5min-24hr auto)Configurable (no min/max)
Storage costNoneNone$1-$4.50/MTok/hr (explicit only)
Min tokens1,024-4,0961,0241,024-4,096
Max breakpoints4 explicitN/A (auto)N/A (1 cache object)
Break-even1 read (5min) / 2 reads (1hr)ImmediateDepends on storage duration

Cost Example: 20-Turn Agent Conversation

Consider a coding agent with a 12,000-token stable prefix (system prompt + tool definitions) over a 20-turn conversation. Without caching, you process 12,000 tokens fresh on each turn: 240,000 total input tokens for the prefix alone.

Provider / ModelWithout CachingWith CachingSavings
Claude Sonnet 4.6 ($3/MTok)$0.72$0.1185%
GPT-5.4 ($2.50/MTok)$0.60$0.06389%
Gemini 2.5 Pro ($1.25/MTok)$0.30$0.03389%

These savings compound. A team running 1,000 agent sessions per day with a 12K prefix saves $590-$660/day on prefix costs alone. Over a month, that is $17,700-$19,800 in reduced input token charges.

Implementation Patterns

The core rule for maximizing cache hits is simple: put static content first, dynamic content last. The cache matches from the beginning of the prompt. Any change in the prefix invalidates everything after it. Structure your prompts accordingly.

Optimal Prompt Structure

Prompt structure for maximum cache reuse

// GOOD: Static content first, dynamic content last
{
  tools: [...],              // Most stable (changes with deploys)
  system: [
    systemInstructions,      // Stable across all conversations
    knowledgeBase,           // Stable for hours/days
    // --- cache breakpoint here ---
  ],
  messages: [
    ...conversationHistory,  // Grows each turn
    latestUserMessage,       // Changes every request
  ]
}

// BAD: Dynamic content early breaks the cache chain
{
  system: [
    `Current time: ${new Date().toISOString()}`,  // Changes every request!
    systemInstructions,  // Never cached because timestamp precedes it
  ],
  messages: [...],
}

Common Mistakes

Timestamps in system prompts

Embedding the current time or date in your system prompt invalidates the entire prefix on every request. Move timestamps to the latest user message instead.

Randomized few-shot examples

Shuffling example order across requests defeats caching. Fix the example order and append any dynamic examples after the cached set.

Caching dynamic tool results

Research found that caching tool call results increased GPT-4o latency by 8.8%. Tool results are session-specific and unlikely to be reused. Exclude them from cache boundaries.

Ignoring the minimum threshold

If your prompt is below 1,024-4,096 tokens (model-dependent), caching silently does nothing. Check the response usage fields to verify cache activation.

Multi-Turn Conversation Pattern

For multi-turn agents, Anthropic's automatic caching is the simplest approach. It advances the cache breakpoint forward with each turn, so the growing conversation history becomes cached content for subsequent turns.

Multi-turn cache progression

// Turn 1: Everything written to cache
// Billed: cache_write(system + tools + msg_1)
// cache_read: 0

// Turn 2: Previous content read from cache, new content written
// Billed: cache_read(system + tools + msg_1 + response_1)
//       + cache_write(msg_2)

// Turn 3: Previous content read from cache, new content written
// Billed: cache_read(system + tools + msg_1..msg_2 + responses)
//       + cache_write(msg_3)

// By turn 10, you're reading 90%+ of the prompt from cache.
// Only the latest user message is fresh computation.

Prompt Caching for Coding Agents

Coding agents are the ideal workload for prompt caching. They have large, stable prefixes (system prompt + tool definitions), long conversation histories (multi-turn debugging), and high request frequency (multiple API calls per minute during active sessions). All three factors maximize cache hit rates.

What Gets Cached in a Coding Agent

ComponentTokensStabilityCaching Impact
System prompt4,000-8,000Stable for days/weeksHighest value cache target
Tool definitions2,000-5,000Stable between deploysHigh value, cache with system prompt
CLAUDE.md / rules1,000-3,000Stable per projectHigh value for project-scoped sessions
Conversation history5,000-50,000+Grows each turnAutomatically cached on subsequent turns
Tool results (file reads, greps)500-5,000 per resultUnique per callDo not cache (session-specific)
Latest user message50-500Changes every turnNever cached (always new)

Measured Results

A comprehensive evaluation across 500+ agent sessions using 10,000-token system prompts measured both API cost and time-to-first-token on the DeepResearchBench benchmark:

79.6%
Cost reduction (GPT-5.2)
78.5%
Cost reduction (Claude Sonnet 4.5)
22.9%
TTFT improvement (Claude Sonnet 4.5)
30.9%
TTFT improvement (GPT-4o)

The key finding: system-prompt-only caching delivered the most consistent benefits across both cost and latency. Full-context caching (including tool results) sometimes hurt performance. Caching volatile content wastes write budget and can increase latency from cache write overhead on content that will never be read again.

The system-prompt-only rule

For coding agents, the optimal strategy is to cache the system prompt and tool definitions explicitly, let conversation history be cached automatically as it grows, and exclude tool results from cache boundaries. This pattern delivered 45-80% cost savings across all tested providers without the latency regressions seen with full-context caching.

Stacking Caching with Compaction

Prompt caching and token compaction solve different halves of the cost equation. Caching reduces the per-token cost of your static prefix. Compaction reduces the total number of tokens in your dynamic content. The two multiply together.

Cost math: caching alone vs. caching + compaction

// Baseline: 20-turn conversation, Claude Sonnet 4.6 ($3/MTok input)
// System + tools: 12,000 tokens (stable)
// Conversation history: 40,000 tokens (dynamic, growing)
// Total input per turn: ~52,000 tokens

// --- Caching alone ---
// Cached prefix: 12,000 tokens × $0.30/MTok = $0.0036/turn
// Fresh input: 40,000 tokens × $3/MTok = $0.12/turn
// Total: $0.1236/turn

// --- Caching + Morph Compact ---
// Cached prefix: 12,000 tokens × $0.30/MTok = $0.0036/turn
// Compacted history: 40,000 → ~16,000 tokens (60% reduction)
// Fresh input: 16,000 tokens × $3/MTok = $0.048/turn
// Total: $0.0516/turn

// Combined savings: 58% cheaper than caching alone
// Combined savings: 90%+ cheaper than no optimization

Morph Compact strips noise from conversation history at 3,300+ tok/s without summarization loss. It removes duplicated context, dead-end exploration traces, and verbose tool outputs, keeping only the information the model needs for subsequent turns. The compacted history is then what gets cached for the next turn.

Prompt Caching

Reduces per-token cost on stable prefixes. 90% savings on system prompts and tool definitions. Provider-level optimization, no code changes to prompt content.

Morph Compact

Reduces total token count in dynamic content. 3,300+ tok/s. Verbatim cleanup without summarization loss. Removes noise before it compounds across turns.

Combined Effect

Cache the static portion at 0.1x cost. Compact the dynamic portion by 60%. Net result: 90%+ reduction in effective input token spend across a full agent session.

This is the same principle behind context rot prevention. Compaction keeps the signal-to-noise ratio high, which means the model reasons over cleaner context and makes fewer mistakes. Caching ensures you do not pay to reprocess the clean context you already have. Together, they reduce both cost and error rate.

When Caching Hurts

Prompt caching is not universally beneficial. There are specific patterns where it adds cost or latency instead of saving it.

PatternWhy It HurtsWhat to Do Instead
Highly dynamic promptsPrefix changes every request, so cache writes are wasted (1.25x cost, 0 reads)Skip cache_control or move dynamic content after the breakpoint
Short prompts (< 1,024 tokens)Below minimum threshold, caching silently does nothingBatch multiple requests or increase prompt length above threshold
Infrequent requestsCache expires before the next request arrives (5-10 min TTL)Use 1-hour TTL on Anthropic, or accept standard pricing
Full-context caching with tool resultsCache writes on unique content increase latency by ~8.8% (GPT-4o)Cache only the system prompt and tool definitions
Long explicit caches on GoogleStorage costs ($4.50/MTok/hr on Pro) can exceed read savingsUse implicit caching or calculate break-even frequency

Full-context caching paradox

The Don't Break the Cache study found that caching everything, including dynamic tool results, actually increased GPT-4o latency by 8.8% compared to no caching. The overhead of writing cache entries for content that will never be read again exceeded the savings on the stable prefix. The lesson: cache selectively. Static content only.

Frequently Asked Questions

What is prompt caching?

Prompt caching reuses previously computed Key-Value tensors for identical prompt prefixes across API requests. Instead of running the full prefill computation on every call, the provider stores the KV tensors and serves them on subsequent requests that share the same prefix. This reduces input token costs by up to 90% and time-to-first-token latency by up to 80%.

How does Anthropic's prompt caching work?

Anthropic uses explicit cache_control breakpoints. You mark content blocks as cacheable with either a 5-minute TTL (1.25x write cost) or 1-hour TTL (2x write cost). Cache reads cost 0.1x the base input price. Up to 4 explicit breakpoints per request, or use automatic caching which manages breakpoints as conversations grow. The cache follows a strict hierarchy: tools, then system, then messages. Changing anything earlier invalidates everything later.

How does OpenAI's prompt caching work?

Fully automatic. Any prompt over 1,024 tokens is eligible. The system caches the longest matching prefix in 128-token increments. No write surcharge. Cached tokens get a 50-90% discount depending on model (GPT-5.4 and GPT-4.1 get 90%). Caches persist 5-10 minutes in memory, with extended 24-hour retention via GPU-local storage on newer models.

How does Google's context caching work?

Two modes. Implicit caching is automatic on Gemini 2.5+, requires no configuration, and has no storage cost. Explicit caching lets you set custom TTLs and guarantees cost savings but charges for storage ($1-4.50/MTok/hr depending on model). Cached reads cost 75-90% less than standard input. Minimum thresholds: 1,024 tokens (Flash) or 4,096 tokens (Pro).

Which provider has the cheapest caching?

Depends on the access pattern. Anthropic offers the deepest read discount (90% off, 0.1x) across all models but charges a write premium. OpenAI has no write premium, making it profitable from the first cache hit, with 90% discounts on GPT-5.4 and GPT-4.1. Google's Gemini 2.5 models match the 90% read discount but charge for storage on explicit caches. For high-frequency agent workloads, Anthropic's model tends to win. For lower-frequency or diverse prompts, OpenAI's automatic approach has less downside risk.

How much does prompt caching save for coding agents?

Research across 500+ agent sessions measured 45-80% cost reductions and 13-31% latency improvements. The savings scale with the size of your stable prefix and the number of turns. A coding agent with a 12,000-token system prompt over 20 turns saves 85-89% on prefix costs. Combined with Morph Compact for conversation history, total savings exceed 90% of baseline input token spend.

Does prompt caching affect output quality?

No. All three providers confirm that cached and non-cached requests produce identical outputs. The KV tensors are mathematically the same whether computed fresh or loaded from cache. Caching only affects the computation path during prefill. The decoding phase is unaffected.

Can I combine prompt caching with batch API discounts?

Yes. Anthropic, OpenAI, and Google all allow stacking batch API discounts (typically 50% off) with prompt caching discounts. On Anthropic, the multipliers compose: batch input at 0.5x times cache read at 0.1x gives 0.05x the base price. That is a 95% discount on cached batch reads.

What is the minimum prompt length for caching?

Varies by provider and model. Anthropic: 1,024-4,096 tokens depending on model. OpenAI: 1,024 tokens across all models. Google: 1,024 (Flash) or 4,096 (Pro). Prompts below these thresholds are processed normally. The response usage fields will show zero cached tokens when caching does not activate.

When does caching not help?

When prompts are highly dynamic with no shared prefix. When prompts are below the minimum cacheable length. When requests are too infrequent to hit cache before TTL expiration. When you are caching dynamic tool results that will never be reused. In these cases, caching adds write overhead without corresponding read savings.

Related Resources

Cut Your Agent Token Costs by 90%+

Prompt caching handles the static prefix. Morph Compact handles the dynamic history. Together, they reduce effective input token spend by over 90% across full agent sessions. 3,300+ tok/s verbatim cleanup, no summarization loss.