LLM Inference: The Complete Engineering Guide (2026)

How LLM inference actually works, from prefill and decode phases to KV cache optimization, speculative decoding, quantization, and inference engine selection. With cost benchmarks.

March 27, 2026 · 6 min read

Over 90% of total LLM operational cost is inference, not training. Training happens once. Inference runs on every request, forever. A 10x improvement in inference efficiency compounds across every API call, every user session, every agentic loop. This guide covers the mechanics, the optimization techniques that actually move the needle, and the cost math for 2026.

80%
API price drop since 2024
2,780
tok/s (TensorRT-LLM, H100)
60-80%
Tokens wasted by coding agents
2-3x
Speedup from speculative decoding

What Is LLM Inference

LLM inference is the forward pass through a transformer model that converts input tokens into output tokens. Unlike training, which updates model weights through backpropagation across billions of examples, inference uses frozen weights to generate predictions. You send a prompt, the model produces a completion.

The process breaks into two distinct phases: prefill and decode. During prefill, the model processes all input tokens in parallel, building the key-value (KV) cache. During decode, it generates output tokens one at a time, each conditioned on every previous token. This autoregressive nature is why output tokens cost 3-10x more than input tokens across every major API provider.

LLM inference pipeline showing prompt input flowing through tokenization, prefill, and decode phases to output

For a coding agent running Claude Sonnet 4, a single agentic loop might involve 50,000 input tokens (code context, tool results, conversation history) and generate 2,000 output tokens (code edits, explanations). At $3/$15 per million tokens, that single loop costs $0.18. Run 100 loops per task across a team of developers and inference cost becomes the dominant line item.

Prefill vs Decode: The Two Phases

Every token you see streaming back from an LLM goes through both phases. Understanding where each phase bottlenecks is the foundation for every optimization technique that follows.

Prefill (Compute-Bound)

Processes all input tokens in parallel via massive matrix multiplications. Maxes out GPU tensor cores. Determines time to first token (TTFT). A 10K token prompt takes 200-400ms on an H100.

Decode (Memory-Bound)

Generates output tokens one at a time autoregressively. Reads the full KV cache and model weights each step but computes little per token. Determines inter-token latency (TPOT). Typical: 30-150 tok/s per request.

The performance metrics map directly to these phases. Time to first token (TTFT) measures prefill latency. Time per output token (TPOT) measures decode latency. Throughput in tokens per second measures how many decode steps the system completes across all concurrent requests.

The key insight: prefill and decode compete for the same GPU resources but have opposite computational profiles. Prefill wants compute. Decode wants memory bandwidth. This tension drives the architecture of modern serving systems.

Disaggregated serving

Production systems at Meta and major cloud providers now separate prefill and decode onto different GPU pools. Compute-optimized GPUs handle prefill. Memory-bandwidth-optimized GPUs handle decode. This eliminates interference and improves both latency and throughput.

KV Cache: Why It Matters

The key-value cache stores computed attention keys and values from all previous tokens so they are not recomputed at each decode step. Without it, generating token N would require reprocessing all N-1 previous tokens, making inference O(n²) per output token. The KV cache reduces this to O(n).

The cost: memory. For a 70B parameter model with a 200K token context window, the KV cache alone can consume 40-80 GB of GPU memory. On an 80 GB H100, that leaves almost nothing for model weights and activations. KV cache optimization is therefore one of the most active research areas in inference.

PagedAttention

Introduced by vLLM, PagedAttention manages KV cache memory like virtual memory pages in an operating system. Instead of pre-allocating contiguous memory for each sequence's maximum possible length, it allocates small blocks on demand and recycles them when sequences finish. This eliminates memory fragmentation and enables 2-4x higher batch sizes on the same GPU.

KV Cache Compression

Three approaches are reducing cache memory in 2026:

NVFP4 Quantization

Quantizes cache values to 4-bit floating point. Less than 1% accuracy loss versus BF16. Halves cache memory with native H100/H200 hardware support.

ChunkKV

Compresses at the semantic chunk level rather than individual tokens. Preserves linguistic structure and contextual integrity under aggressive compression ratios.

Token Eviction

Drops low-importance tokens from the cache based on attention scores. Reduces memory linearly with eviction rate. Works best with structured or repetitive inputs.

Optimization Techniques

The techniques below are not theoretical. They are running in production across every major inference provider in 2026. Each addresses a different bottleneck.

Speculative Decoding

A small draft model (1-7B parameters) generates 3-12 candidate tokens per step. The large target model verifies all candidates in a single parallel forward pass. Correct predictions (70-90% on domain-specific tasks) yield multiple tokens for the compute cost of one target model step. Rejected tokens are resampled from the target distribution, so output quality is mathematically identical. Result: 2-3x speedup on generation-heavy workloads.

Quantization

Reducing weight precision from FP16/BF16 to lower bit widths trades minimal accuracy for major memory and speed gains.

MethodPrecisionMemory SavingsSpeed ImpactAccuracy Loss
FP88-bit float2x reduction2x on H100/H200<0.5%
GPTQ4-bit int4x reduction2.6x w/ Marlin kernel1-3%
AWQ4-bit int4x reduction10.9x w/ Marlin kernel1-2%
GGUF Q4_K_M4-bit mixed4x reductionCPU-friendly1-3%

The kernel matters more than the quantization method. AWQ without optimized kernels runs at 68 tok/s. AWQ with the Marlin kernel hits 741 tok/s. Same quantized weights, 10.9x difference.

Flash Attention

Flash Attention rewrites the attention computation to be IO-aware, fusing operations to minimize reads and writes to GPU high-bandwidth memory. Flash Attention 2 delivers 2x speedup over the original. Flash Attention 3 leverages H100 TMA (Tensor Memory Accelerator) units for further gains. Every major inference engine now uses Flash Attention by default.

Continuous Batching

Traditional static batching waits for all sequences in a batch to finish before accepting new requests. Continuous batching (also called iteration-level batching) inserts new requests as soon as any sequence completes, keeping GPU utilization near 100%. Batching 32 requests together cuts per-token costs by approximately 85% with minor latency impact.

The compounding effect

These techniques stack. FP8 quantization + Flash Attention 3 + continuous batching + speculative decoding on an H100 delivers 5-8x better cost-efficiency than naive FP16 inference with static batching. The gap between optimized and unoptimized inference is wider than the gap between GPU generations.

Inference Engines Compared

Three engines dominate production LLM serving in 2026. All support the core optimizations above, but they differ in compilation strategy, cold start, and workload affinity.

EngineOutput tok/sTTFT (p50)Cold StartBest For
TensorRT-LLM2,780680ms~28 minPeak throughput
SGLang2,460710ms~58sShared-prefix workloads
vLLM2,400740ms~62sFast iteration, broad model support

vLLM

Fastest path to production. PagedAttention pioneered here. Broad model support, 60-second cold start. Use when you need model flexibility and quick deploys.

TensorRT-LLM

Peak throughput after compilation. 13% faster than vLLM at high concurrency. 28-minute cold start is the tradeoff. Use for single-model, long-running production deployments.

SGLang

Automatic prefix caching for shared-prefix workloads (chatbots, RAG, multi-turn). RadixAttention reuses common prompt prefixes across requests. Use for conversation-heavy applications.

Other notable engines: llama.cpp for CPU and Apple Silicon inference, Ollama for local development, LMDeploy which matches SGLang at ~16,200 tok/s in high-batch scenarios, and TGI (Text Generation Inference) from Hugging Face for quick experimentation.

Inference Cost Analysis

API prices have fallen roughly 80% since 2024, but total inference spend keeps rising because token volume grows faster than prices drop. Understanding the cost structure is the first step to controlling it.

ModelInputOutputCached Input
Claude Sonnet 4$3.00$15.00$0.30
Claude Opus 4$15.00$75.00$1.50
GPT-5.2$1.75$14.00$0.175
Gemini 2.5 Pro$1.25$10.00$0.315
DeepSeek V3.2$0.14$0.28N/A

Output tokens cost 3-10x more than input tokens across every provider. This reflects the computational reality: each output token requires a full decode step, while input tokens are processed in parallel during prefill.

Cost Reduction Strategies

90%
Savings from prompt caching
85%
Savings from batch processing
50%
Savings from batch API endpoints
40-95%
Savings from context compression

These techniques compound. Prompt caching on frequently reused system prompts and few-shot examples eliminates 90% of input token costs for repeat patterns. Batch endpoints (like OpenAI's) halve costs for latency-tolerant workloads. Context compression reduces the tokens you send in the first place.

Coding Agent Inference

Coding agents are the most inference-intensive consumer of LLM APIs. A typical agentic coding session involves dozens of tool calls, each with growing context from file reads, search results, and conversation history. The problem is not the model. The problem is what you feed it.

60-80%
Tokens spent searching, not solving
93%
Token waste in a real Claude Code query
12,000
Tokens consumed (800 needed)
40-95%
Reduction from context compression

An agent reading 20 files to find one relevant function is sending 20 files worth of tokens through inference. Each subsequent tool call reprocesses the entire conversation history. By message 30 in a session, the context window is full of stale search results the model no longer needs.

Why Context Compression Changes the Math

The highest-leverage optimization for agentic inference is not a faster GPU or a cheaper model. It is sending fewer tokens. Morph Compact compresses context before it reaches the model, removing redundant code, stale search results, and irrelevant file content while preserving the information the model actually needs.

When an agent spends 60% fewer tokens per action, the downstream effects compound: lower cost per action, more headroom before hitting rate limits, faster TTFT because prefill processes fewer tokens, and less context degradation from irrelevant information crowding the attention window.

Compact for Context Reduction

Compresses agent context by 40-70% before inference. The model sees less noise, generates better outputs, and costs less per action. Works with any LLM provider.

WarpGrep for Targeted Search

Runs code search in a separate context window. Returns only relevant line ranges instead of entire files. Eliminates the 60-80% token waste on search-then-read patterns.

The real inference optimization

Faster inference engines and cheaper APIs help. But when 60-80% of your tokens are wasted on context the model does not need, the biggest win is not processing those tokens at all. Compress first, infer second.

Frequently Asked Questions

What is LLM inference?

LLM inference is the process of generating output tokens from a trained large language model given an input prompt. It has two phases: prefill (processing input tokens in parallel to build the KV cache) and decode (generating output tokens one at a time). Inference accounts for over 90% of total LLM operational cost because training happens once but inference runs on every request.

What is the difference between prefill and decode in LLM inference?

Prefill processes all input tokens in parallel, building the key-value cache. It is compute-bound and determines time to first token (TTFT). Decode generates output tokens one at a time, reusing the KV cache. It is memory-bandwidth-bound and determines inter-token latency. A 10,000 token prompt takes 200-400ms to prefill on an H100, while decode typically runs at 30-150 tokens per second.

How much does LLM inference cost per token in 2026?

API prices dropped roughly 80% from 2024 to 2026. Claude Sonnet 4 costs $3/M input and $15/M output tokens. GPT-5.2 costs $1.75/$14. Gemini 2.5 Pro costs $1.25/$10. DeepSeek V3.2 costs $0.14/$0.28. Output tokens cost 3-10x more than input tokens because each requires a full decode step. Prompt caching reduces input costs by up to 90%.

What is speculative decoding and how much does it speed up inference?

Speculative decoding uses a small draft model (1-7B parameters) to generate 3-12 candidate tokens per step, which the target model verifies in one parallel forward pass. Correct predictions (70-90% on domain-specific tasks) yield multiple tokens for the cost of one step. This gives 2-3x speedups with zero quality loss because rejected tokens are resampled from the target model's distribution.

Which LLM inference engine is fastest in 2026?

On H100 GPUs at 100 concurrent requests, TensorRT-LLM leads with 2,780 output tok/s, followed by SGLang at 2,460 and vLLM at 2,400. TensorRT-LLM requires 28 minutes of compilation cold start versus about 60 seconds for vLLM and SGLang. Use vLLM for fastest path to production, TensorRT-LLM for peak throughput, SGLang for shared-prefix workloads like chatbots and RAG.

What is the KV cache and why does it matter for inference?

The KV cache stores computed attention keys and values from previous tokens so they are not recomputed at each decode step. Without it, inference requires O(n²) compute per token. The KV cache reduces this to O(n). For a 70B model with 200K context, the cache alone can consume 40-80 GB of GPU memory, making cache optimization critical for throughput and batch size.

How do coding agents waste inference tokens?

Coding agents consume 60-80% of their tokens searching for relevant code rather than solving problems. A single query can burn 12,000 tokens when 800 would suffice. Each subsequent tool call reprocesses the full conversation history, including stale results. Context compression tools like Morph Compact reduce this waste by 40-95% depending on query type.

Cut Inference Costs at the Source

Compact compresses agent context by 40-70% before it hits the model. Fewer tokens in means lower cost, faster TTFT, and better outputs.