Over 90% of total LLM operational cost is inference, not training. Training happens once. Inference runs on every request, forever. A 10x improvement in inference efficiency compounds across every API call, every user session, every agentic loop. This guide covers the mechanics, the optimization techniques that actually move the needle, and the cost math for 2026.
What Is LLM Inference
LLM inference is the forward pass through a transformer model that converts input tokens into output tokens. Unlike training, which updates model weights through backpropagation across billions of examples, inference uses frozen weights to generate predictions. You send a prompt, the model produces a completion.
The process breaks into two distinct phases: prefill and decode. During prefill, the model processes all input tokens in parallel, building the key-value (KV) cache. During decode, it generates output tokens one at a time, each conditioned on every previous token. This autoregressive nature is why output tokens cost 3-10x more than input tokens across every major API provider.

For a coding agent running Claude Sonnet 4, a single agentic loop might involve 50,000 input tokens (code context, tool results, conversation history) and generate 2,000 output tokens (code edits, explanations). At $3/$15 per million tokens, that single loop costs $0.18. Run 100 loops per task across a team of developers and inference cost becomes the dominant line item.
Prefill vs Decode: The Two Phases
Every token you see streaming back from an LLM goes through both phases. Understanding where each phase bottlenecks is the foundation for every optimization technique that follows.
Prefill (Compute-Bound)
Processes all input tokens in parallel via massive matrix multiplications. Maxes out GPU tensor cores. Determines time to first token (TTFT). A 10K token prompt takes 200-400ms on an H100.
Decode (Memory-Bound)
Generates output tokens one at a time autoregressively. Reads the full KV cache and model weights each step but computes little per token. Determines inter-token latency (TPOT). Typical: 30-150 tok/s per request.
The performance metrics map directly to these phases. Time to first token (TTFT) measures prefill latency. Time per output token (TPOT) measures decode latency. Throughput in tokens per second measures how many decode steps the system completes across all concurrent requests.
The key insight: prefill and decode compete for the same GPU resources but have opposite computational profiles. Prefill wants compute. Decode wants memory bandwidth. This tension drives the architecture of modern serving systems.
Disaggregated serving
Production systems at Meta and major cloud providers now separate prefill and decode onto different GPU pools. Compute-optimized GPUs handle prefill. Memory-bandwidth-optimized GPUs handle decode. This eliminates interference and improves both latency and throughput.
KV Cache: Why It Matters
The key-value cache stores computed attention keys and values from all previous tokens so they are not recomputed at each decode step. Without it, generating token N would require reprocessing all N-1 previous tokens, making inference O(n²) per output token. The KV cache reduces this to O(n).
The cost: memory. For a 70B parameter model with a 200K token context window, the KV cache alone can consume 40-80 GB of GPU memory. On an 80 GB H100, that leaves almost nothing for model weights and activations. KV cache optimization is therefore one of the most active research areas in inference.
PagedAttention
Introduced by vLLM, PagedAttention manages KV cache memory like virtual memory pages in an operating system. Instead of pre-allocating contiguous memory for each sequence's maximum possible length, it allocates small blocks on demand and recycles them when sequences finish. This eliminates memory fragmentation and enables 2-4x higher batch sizes on the same GPU.
KV Cache Compression
Three approaches are reducing cache memory in 2026:
NVFP4 Quantization
Quantizes cache values to 4-bit floating point. Less than 1% accuracy loss versus BF16. Halves cache memory with native H100/H200 hardware support.
ChunkKV
Compresses at the semantic chunk level rather than individual tokens. Preserves linguistic structure and contextual integrity under aggressive compression ratios.
Token Eviction
Drops low-importance tokens from the cache based on attention scores. Reduces memory linearly with eviction rate. Works best with structured or repetitive inputs.
Optimization Techniques
The techniques below are not theoretical. They are running in production across every major inference provider in 2026. Each addresses a different bottleneck.
Speculative Decoding
A small draft model (1-7B parameters) generates 3-12 candidate tokens per step. The large target model verifies all candidates in a single parallel forward pass. Correct predictions (70-90% on domain-specific tasks) yield multiple tokens for the compute cost of one target model step. Rejected tokens are resampled from the target distribution, so output quality is mathematically identical. Result: 2-3x speedup on generation-heavy workloads.
Quantization
Reducing weight precision from FP16/BF16 to lower bit widths trades minimal accuracy for major memory and speed gains.
| Method | Precision | Memory Savings | Speed Impact | Accuracy Loss |
|---|---|---|---|---|
| FP8 | 8-bit float | 2x reduction | 2x on H100/H200 | <0.5% |
| GPTQ | 4-bit int | 4x reduction | 2.6x w/ Marlin kernel | 1-3% |
| AWQ | 4-bit int | 4x reduction | 10.9x w/ Marlin kernel | 1-2% |
| GGUF Q4_K_M | 4-bit mixed | 4x reduction | CPU-friendly | 1-3% |
The kernel matters more than the quantization method. AWQ without optimized kernels runs at 68 tok/s. AWQ with the Marlin kernel hits 741 tok/s. Same quantized weights, 10.9x difference.
Flash Attention
Flash Attention rewrites the attention computation to be IO-aware, fusing operations to minimize reads and writes to GPU high-bandwidth memory. Flash Attention 2 delivers 2x speedup over the original. Flash Attention 3 leverages H100 TMA (Tensor Memory Accelerator) units for further gains. Every major inference engine now uses Flash Attention by default.
Continuous Batching
Traditional static batching waits for all sequences in a batch to finish before accepting new requests. Continuous batching (also called iteration-level batching) inserts new requests as soon as any sequence completes, keeping GPU utilization near 100%. Batching 32 requests together cuts per-token costs by approximately 85% with minor latency impact.
The compounding effect
These techniques stack. FP8 quantization + Flash Attention 3 + continuous batching + speculative decoding on an H100 delivers 5-8x better cost-efficiency than naive FP16 inference with static batching. The gap between optimized and unoptimized inference is wider than the gap between GPU generations.
Inference Engines Compared
Three engines dominate production LLM serving in 2026. All support the core optimizations above, but they differ in compilation strategy, cold start, and workload affinity.
| Engine | Output tok/s | TTFT (p50) | Cold Start | Best For |
|---|---|---|---|---|
| TensorRT-LLM | 2,780 | 680ms | ~28 min | Peak throughput |
| SGLang | 2,460 | 710ms | ~58s | Shared-prefix workloads |
| vLLM | 2,400 | 740ms | ~62s | Fast iteration, broad model support |
vLLM
Fastest path to production. PagedAttention pioneered here. Broad model support, 60-second cold start. Use when you need model flexibility and quick deploys.
TensorRT-LLM
Peak throughput after compilation. 13% faster than vLLM at high concurrency. 28-minute cold start is the tradeoff. Use for single-model, long-running production deployments.
SGLang
Automatic prefix caching for shared-prefix workloads (chatbots, RAG, multi-turn). RadixAttention reuses common prompt prefixes across requests. Use for conversation-heavy applications.
Other notable engines: llama.cpp for CPU and Apple Silicon inference, Ollama for local development, LMDeploy which matches SGLang at ~16,200 tok/s in high-batch scenarios, and TGI (Text Generation Inference) from Hugging Face for quick experimentation.
Inference Cost Analysis
API prices have fallen roughly 80% since 2024, but total inference spend keeps rising because token volume grows faster than prices drop. Understanding the cost structure is the first step to controlling it.
| Model | Input | Output | Cached Input |
|---|---|---|---|
| Claude Sonnet 4 | $3.00 | $15.00 | $0.30 |
| Claude Opus 4 | $15.00 | $75.00 | $1.50 |
| GPT-5.2 | $1.75 | $14.00 | $0.175 |
| Gemini 2.5 Pro | $1.25 | $10.00 | $0.315 |
| DeepSeek V3.2 | $0.14 | $0.28 | N/A |
Output tokens cost 3-10x more than input tokens across every provider. This reflects the computational reality: each output token requires a full decode step, while input tokens are processed in parallel during prefill.
Cost Reduction Strategies
These techniques compound. Prompt caching on frequently reused system prompts and few-shot examples eliminates 90% of input token costs for repeat patterns. Batch endpoints (like OpenAI's) halve costs for latency-tolerant workloads. Context compression reduces the tokens you send in the first place.
Coding Agent Inference
Coding agents are the most inference-intensive consumer of LLM APIs. A typical agentic coding session involves dozens of tool calls, each with growing context from file reads, search results, and conversation history. The problem is not the model. The problem is what you feed it.
An agent reading 20 files to find one relevant function is sending 20 files worth of tokens through inference. Each subsequent tool call reprocesses the entire conversation history. By message 30 in a session, the context window is full of stale search results the model no longer needs.
Why Context Compression Changes the Math
The highest-leverage optimization for agentic inference is not a faster GPU or a cheaper model. It is sending fewer tokens. Morph Compact compresses context before it reaches the model, removing redundant code, stale search results, and irrelevant file content while preserving the information the model actually needs.
When an agent spends 60% fewer tokens per action, the downstream effects compound: lower cost per action, more headroom before hitting rate limits, faster TTFT because prefill processes fewer tokens, and less context degradation from irrelevant information crowding the attention window.
Compact for Context Reduction
Compresses agent context by 40-70% before inference. The model sees less noise, generates better outputs, and costs less per action. Works with any LLM provider.
WarpGrep for Targeted Search
Runs code search in a separate context window. Returns only relevant line ranges instead of entire files. Eliminates the 60-80% token waste on search-then-read patterns.
The real inference optimization
Faster inference engines and cheaper APIs help. But when 60-80% of your tokens are wasted on context the model does not need, the biggest win is not processing those tokens at all. Compress first, infer second.
Frequently Asked Questions
What is LLM inference?
LLM inference is the process of generating output tokens from a trained large language model given an input prompt. It has two phases: prefill (processing input tokens in parallel to build the KV cache) and decode (generating output tokens one at a time). Inference accounts for over 90% of total LLM operational cost because training happens once but inference runs on every request.
What is the difference between prefill and decode in LLM inference?
Prefill processes all input tokens in parallel, building the key-value cache. It is compute-bound and determines time to first token (TTFT). Decode generates output tokens one at a time, reusing the KV cache. It is memory-bandwidth-bound and determines inter-token latency. A 10,000 token prompt takes 200-400ms to prefill on an H100, while decode typically runs at 30-150 tokens per second.
How much does LLM inference cost per token in 2026?
API prices dropped roughly 80% from 2024 to 2026. Claude Sonnet 4 costs $3/M input and $15/M output tokens. GPT-5.2 costs $1.75/$14. Gemini 2.5 Pro costs $1.25/$10. DeepSeek V3.2 costs $0.14/$0.28. Output tokens cost 3-10x more than input tokens because each requires a full decode step. Prompt caching reduces input costs by up to 90%.
What is speculative decoding and how much does it speed up inference?
Speculative decoding uses a small draft model (1-7B parameters) to generate 3-12 candidate tokens per step, which the target model verifies in one parallel forward pass. Correct predictions (70-90% on domain-specific tasks) yield multiple tokens for the cost of one step. This gives 2-3x speedups with zero quality loss because rejected tokens are resampled from the target model's distribution.
Which LLM inference engine is fastest in 2026?
On H100 GPUs at 100 concurrent requests, TensorRT-LLM leads with 2,780 output tok/s, followed by SGLang at 2,460 and vLLM at 2,400. TensorRT-LLM requires 28 minutes of compilation cold start versus about 60 seconds for vLLM and SGLang. Use vLLM for fastest path to production, TensorRT-LLM for peak throughput, SGLang for shared-prefix workloads like chatbots and RAG.
What is the KV cache and why does it matter for inference?
The KV cache stores computed attention keys and values from previous tokens so they are not recomputed at each decode step. Without it, inference requires O(n²) compute per token. The KV cache reduces this to O(n). For a 70B model with 200K context, the cache alone can consume 40-80 GB of GPU memory, making cache optimization critical for throughput and batch size.
How do coding agents waste inference tokens?
Coding agents consume 60-80% of their tokens searching for relevant code rather than solving problems. A single query can burn 12,000 tokens when 800 would suffice. Each subsequent tool call reprocesses the full conversation history, including stale results. Context compression tools like Morph Compact reduce this waste by 40-95% depending on query type.
Cut Inference Costs at the Source
Compact compresses agent context by 40-70% before it hits the model. Fewer tokens in means lower cost, faster TTFT, and better outputs.