KV Cache Explained: Why It's the Most Important Optimization in LLM Inference

What the KV cache is, why it makes LLM inference 5x faster, how prefix caching and paged attention work, KV cache quantization (FP8, INT4), memory formulas per model size, and how fewer input tokens shrink the cache for faster, cheaper generation.

April 5, 2026 · 2 min read

Every time a transformer generates a token, it recomputes attention over the entire preceding sequence. For a 70B model at 128K context, that means rebuilding billions of floating-point operations per step. The KV cache stores the results once and reuses them, turning a quadratic problem into a linear one. Without it, real-time token streaming would not exist.

4.7x
Speedup from KV caching (T4 benchmark)
40 GB
KV cache for 70B model at 128K context
<4%
Memory waste with paged attention
10x
Cost reduction from prefix caching

What the KV Cache Is

Transformer models generate text one token at a time. To decide what comes next, the model runs an attention computation where each new token "looks back" at all previous tokens. This attention step produces three matrices from each token: a Query (Q), a Key (K), and a Value (V).

The Query represents what the current token is looking for. The Keys represent what each previous token offers. The Values hold the actual information that gets passed forward. Attention computes a weighted combination of all Values, where the weights come from matching the current Query against all stored Keys.

The KV cache stores the Key and Value tensors from every previous token, across every layer of the model. When token 501 is being generated, the model does not recompute the Keys and Values for tokens 1 through 500. It pulls them from the cache, computes only the new token's Q, K, and V, appends the new K and V to the cache, and runs the attention lookup. One computation per token per layer instead of recomputing the full sequence.

The core tradeoff

The KV cache trades memory for compute. It eliminates redundant calculation by storing intermediate results, but those results consume GPU memory that grows linearly with sequence length. Every optimization technique covered in this article is a different approach to managing this tradeoff.

How Attention Creates the Problem

Without a KV cache, generating the 501st token requires recomputing Keys and Values for all 500 preceding tokens. Then the 502nd token recomputes for 501 tokens. The 503rd for 502. Each generation step repeats work that the previous step already did.

The cost grows quadratically. Generating N tokens from scratch requires approximately N²/2 total attention computations across all steps. For a 1,000-token generation, that is roughly 500,000 attention operations. With KV caching, the same generation requires approximately 1,000 operations, one per step, each attending to the growing cache.

Benchmarks on a Tesla T4 show the concrete difference: generating 1,000 tokens takes 11.9 seconds with KV caching and 56.2 seconds without. That 4.7x gap widens with longer sequences because the quadratic cost accelerates while the cached cost stays linear.

Without KV cache: O(n²) per generation

Each new token recomputes attention for the full sequence. Token 500 processes all 499 predecessors. Token 1000 processes all 999. Total work scales quadratically with output length.

With KV cache: O(n) per generation

Each new token computes one Q/K/V set, appends K and V to the cache, and attends to cached entries. The cache grows by a constant amount per token. Total work scales linearly.

How the Cache Solves It

LLM inference has two distinct phases, and the KV cache plays a different role in each.

Prefill phase

The model processes the entire input prompt in a single forward pass. All tokens in the prompt are visible simultaneously, so the model computes Q, K, and V for every token in parallel. The resulting K and V tensors are stored in the cache. For a 10,000-token prompt on an H100, prefill takes 200-400ms.

Decode phase

The model generates output tokens one at a time. For each new token, it computes a single Q vector, matches it against the full set of cached Keys, retrieves the weighted Values, and produces the next token. The new token's K and V get appended to the cache. Decode typically runs at 30-150 tokens per second depending on model size and hardware.

The distinction matters for optimization. Prefill is compute-bound (many tokens processed in parallel). Decode is memory-bandwidth-bound (the model reads the entire cache on every step to compute attention, but only processes one token). This is why the KV cache size directly affects decode speed: a larger cache means more data the GPU must read from memory on every single token.

Memory Formula

The KV cache size for a single sequence is calculated as:

KV cache (bytes) = 2 × num_layers × num_kv_heads × head_dim × seq_length × bytes_per_element

Breaking this down:

  • 2 accounts for both the Key and Value tensors.
  • num_layers is the transformer depth (32 for Llama 3 8B, 80 for Llama 3.1 70B).
  • num_kv_heads is the number of key-value attention heads, which may be smaller than the query heads if the model uses Grouped Query Attention.
  • head_dim is typically 128 in modern models.
  • seq_length is the total number of tokens (prompt + generated).
  • bytes_per_element is 2 for BF16/FP16, 1 for FP8, 0.5 for INT4.

Multiply by batch_size for concurrent sequences. The formula shows why KV cache memory is the primary constraint on serving capacity: it scales with sequence length, model depth, and batch size simultaneously.

Memory per Model Size

Concrete numbers for common models at BF16 precision, single sequence:

ModelLayersKV Heads4K Context32K Context128K Context
Llama 3 8B3280.13 GB1 GB4 GB
Llama 3.1 70B8080.33 GB2.5 GB~40 GB
Llama 3.1 405B12680.5 GB4 GB~64 GB
GPT-4 class (est.)~120~16~1 GB~8 GB~128 GB

At 128K context, the 70B model's KV cache alone consumes 40 GB, which is more than the model weights at INT4 quantization (~35 GB). For the 405B model, the cache at full context exceeds the memory of a single H100 (80 GB). This is why long-context inference requires either multi-GPU setups or aggressive cache optimization.

Per-token cost: Llama 3.1 70B consumes approximately 0.31 MB per token in the KV cache at BF16. Each additional 1,000 tokens of context adds ~310 MB of GPU memory. For the 8B model, the figure is roughly 0.03 MB per token.

Grouped Query Attention

Standard Multi-Head Attention (MHA) maintains separate Key and Value heads for each Query head. A model with 64 attention heads stores 64 sets of Keys and 64 sets of Values in the cache. Grouped Query Attention (GQA) reduces this by sharing KV heads across groups of Query heads.

Llama 3.1 70B has 64 query heads but only 8 KV heads. Each group of 8 query heads shares one KV head. This cuts KV cache memory by 8x compared to standard MHA while retaining near-equivalent accuracy.

Multi-Head Attention (MHA)

Every query head has its own KV head. Maximum representational power, maximum cache size. Used in GPT-2, original GPT-3.

Multi-Query Attention (MQA)

All query heads share a single KV head. Minimum cache size, but measurable accuracy degradation on some tasks. Used in PaLM, Falcon.

Grouped Query Attention (GQA)

Query heads are grouped, each group sharing one KV head. 30-40% faster than MHA, near-equivalent accuracy. Used in Llama 3, Mistral, Gemma.

GQA is the default attention mechanism in modern LLMs. It exists specifically because the KV cache memory cost at long context lengths made standard MHA impractical for production serving.

Paged Attention

Traditional serving frameworks pre-allocate a contiguous block of GPU memory for each sequence's KV cache, sized for the maximum possible context length. Since most requests use far less than the maximum, 60-80% of allocated KV cache memory sits unused. The fragmentation problem mirrors what operating systems faced before virtual memory: large contiguous allocations leave unusable gaps.

PagedAttention, introduced by vLLM, applies the same solution. Instead of one contiguous allocation per sequence, the KV cache is divided into fixed-size blocks (typically 16 tokens each). Blocks are allocated on demand as the sequence grows. A block table maps logical positions to physical memory locations, just like a page table maps virtual addresses to physical RAM.

60-80%
Memory wasted without paged attention
<4%
Memory wasted with paged attention
2-4x
Throughput improvement
16 tok
Typical page (block) size

The practical impact: vLLM with PagedAttention achieves near-optimal memory utilization. More sequences fit in GPU memory simultaneously, which translates to 2-4x higher serving throughput at equivalent latency. The original vLLM paper reported up to 24x throughput improvement over FasterTransformer on certain workloads.

Paged attention also enables KV cache sharing. When multiple requests use the same system prompt, the physical blocks storing that prefix can be shared across sequences through copy-on-write semantics. This is the mechanism that makes prefix caching practical at the serving layer.

Prefix Caching

When two requests share the same system prompt, tool definitions, or conversation prefix, prefix caching reuses the KV cache from the shared portion instead of recomputing it. The first request pays the full prefill cost. Every subsequent request with the same prefix skips that computation entirely and begins from where the shared context ends.

This works because the KV cache for any given token depends only on all preceding tokens. If two requests have identical tokens at positions 1 through 5,000, their KV caches at position 5,000 are bit-for-bit identical. There is no need to recompute them.

Provider implementations

Anthropic, OpenAI, and Google all offer prompt caching via their APIs. Anthropic's implementation charges cached tokens at $0.30/M versus $3/M for uncached input, a 10x cost difference. OpenAI's automatic caching applies a 50% discount on cached tokens. For contexts over 10K tokens, cached portions see 80-90% latency reduction because the prefill phase is skipped.

Self-hosted implementations

vLLM's Automatic Prefix Caching (APC) detects shared prefixes across requests and reuses their KV cache pages. SGLang's RadixAttention organizes cached prefixes as a radix tree, enabling efficient lookup and sharing across requests with partially overlapping prefixes, not just identical ones. Both operate automatically with no application-level changes required.

Cache hit rate depends on prompt structure

Prefix caching only helps when requests share identical token sequences from the start. Moving variable content (user queries, dynamic data) after the shared prefix maximizes cache hits. If your system prompt changes between requests, or if you reorder tool definitions, the cache misses on every call. Prompt structure is a first-class performance surface.

KV Cache Quantization

Model weight quantization is well understood: compress weights from FP16 to INT8/INT4, trade a small accuracy hit for 2-4x memory savings. KV cache quantization applies the same idea to the cached attention tensors, which can consume more memory than the weights at long context lengths.

FormatBytes/ElementMemory vs. FP16Accuracy ImpactSupport
FP16 / BF1621x (baseline)NoneAll frameworks
FP8 (E4M3)12x reductionMinimalvLLM, TensorRT-LLM, SGLang
INT812x reductionNegligiblelmdeploy, TensorRT-LLM
INT40.5~4x reductionSlight loss at high batchlmdeploy, vLLM
NVFP40.5~4x reduction (2x vs FP8)MinimalTensorRT-LLM (Blackwell)

FP8 is the practical default for production deployments. It halves KV cache memory with accuracy impact so small that most benchmarks cannot distinguish it from FP16. INT4 pushes memory savings further (roughly 4x vs FP16) but introduces slight accuracy degradation at high batch sizes and may reduce generation speed in some configurations.

NVIDIA's NVFP4 format, designed for Blackwell GPUs, achieves a further 2x reduction over FP8, enabling longer context windows and higher concurrency on the same hardware. These gains compound with paged attention: quantized caches in paged blocks mean each physical block holds more tokens, further increasing effective GPU memory capacity.

Multi-Head Latent Attention (MLA)

DeepSeek-V2 introduced a more aggressive approach to KV cache compression. Instead of reducing the number of KV heads (GQA) or quantizing them (FP8/INT4), Multi-Head Latent Attention projects Keys and Values into a low-dimensional latent space before caching. When attention is computed, the latent vector is decompressed back to the full-dimensional space.

The result: a 93.3% reduction in KV cache memory compared to standard MHA. DeepSeek-V2 reported 5.76x higher maximum generation throughput versus its predecessor model, with accuracy that matched or exceeded standard multi-head attention.

MLA is architecturally different from GQA. Where GQA shares heads, MLA compresses the representation itself. The two approaches are not mutually exclusive in principle, but in practice MLA achieves a compression ratio that makes further head-sharing unnecessary. The tradeoff is added compute during decompression, which MLA offsets by absorbing the decompression into the attention computation itself.

Why Fewer Tokens = Faster Inference

Understanding the KV cache makes the connection between token count and performance concrete. Fewer input tokens produce three compounding effects:

Faster prefill

The prefill phase processes all input tokens to build the initial KV cache. Cutting a 10,000-token prompt to 4,000 tokens reduces prefill time by roughly 60%. This directly lowers time to first token (TTFT).

Faster decode

Every generated token reads the entire KV cache during attention. A smaller cache means less memory bandwidth per decode step. The effect is modest per-token but compounds over hundreds of output tokens.

Higher throughput

Smaller KV caches leave more GPU memory for concurrent requests. A prompt reduced from 10K to 4K tokens frees 60% of that sequence's cache memory, which the serving engine can allocate to additional requests.

This is why application-level optimizations like context compression and prompt caching have outsized impact. They operate at the input, before the KV cache is built. Every token removed from the input removes one KV entry from every layer of the model for the entire duration of generation.

For coding agents, the effect is especially pronounced. An agent running a multi-turn session accumulates tool outputs, file contents, and search results across turns. By turn 30, the context window may contain 50,000+ tokens, most of which are stale. Sending all 50,000 tokens builds a 50,000-entry KV cache per layer. Compacting to 20,000 tokens before the call builds a cache 60% smaller, saving memory, speeding up both prefill and decode, and leaving room for the provider to batch more requests on the same GPU.

Morph Compact and the KV cache

Morph Compact reduces context 50-70% through verbatim deletion at 33,000 tok/s. Every token it removes is one fewer KV entry per layer, across the entire generation. For a 70B model, compacting 10K tokens from the prompt frees roughly 3.1 GB of KV cache memory per sequence. That is memory the serving engine can use for longer outputs, more concurrent users, or both.

Frequently Asked Questions

What is the KV cache in LLMs?

The KV cache stores the Key and Value tensors computed during the attention step of each transformer layer. Without it, every new token would require recomputing attention over the entire preceding sequence, making generation quadratically expensive. With the cache, each new token only computes its own key-value pair and looks up the cached values, reducing per-token cost from O(n²) to O(n).

How much memory does the KV cache use?

KV cache memory equals 2 × num_layers × num_kv_heads × head_dim × seq_length × bytes_per_element. For Llama 3.1 70B at 128K context in BF16, that is approximately 40 GB for a single sequence, often exceeding the memory used by the model weights themselves at INT4 quantization (~35 GB). The 8B model at 32K context uses roughly 1 GB.

What is paged attention and why does it matter?

PagedAttention, introduced by vLLM, borrows virtual memory concepts from operating systems to manage KV cache memory. Instead of pre-allocating contiguous memory for the maximum sequence length (which wastes 60-80% of GPU memory), it allocates fixed-size blocks on demand. This reduces memory waste to under 4% and enables 2-4x higher serving throughput because more requests fit in GPU memory simultaneously.

How does prefix caching reduce LLM cost?

When multiple requests share the same system prompt or conversation prefix, prefix caching reuses the KV cache from the shared portion instead of recomputing it. Anthropic charges cached tokens at $0.30/M versus $3/M for uncached input. Latency drops 80-90% for the cached portion because the prefill phase is skipped entirely.

What is KV cache quantization?

KV cache quantization reduces the precision of cached key-value tensors from FP16/BF16 to lower formats. FP8 halves cache memory with minimal accuracy impact. INT4 achieves roughly 4x memory savings. These techniques allow longer context windows and larger batch sizes on the same hardware, and they compound with paged attention and prefix caching.

How does Grouped Query Attention reduce KV cache size?

Standard multi-head attention maintains separate KV heads for each query head. GQA shares a smaller set of KV heads across groups. Llama 3.1 70B uses 8 KV heads instead of 64 query heads, reducing KV cache memory by 8x. GQA is now the default in Llama 3, Mistral, Gemma, and most production models because it provides 30-40% faster inference with near-equivalent accuracy.

Why do fewer input tokens mean faster LLM inference?

Fewer input tokens build a smaller KV cache. A smaller cache means faster prefill (less initial computation), faster decode (less memory bandwidth per output token), and higher throughput (more concurrent sequences fit in GPU memory). Reducing a 10,000-token prompt to 4,000 tokens cuts prefill time by roughly 60% and frees KV cache memory proportionally.

Shrink the Cache Before It's Built

Morph Compact removes 50-70% of context tokens at 33,000 tok/s through verbatim deletion. Every token cut is one fewer KV entry per layer, across every generated token. Smaller cache, faster prefill, lower cost.