vLLM is the default inference engine for most teams self-hosting LLMs. PagedAttention, continuous batching, an OpenAI-compatible API, and broad model support make it the fastest path from "download a model" to "serve 1,000 concurrent users." This page collects the benchmarks that matter: throughput, TTFT, ITL, and P99 latency across H100 and A100 hardware. Plus comparisons to SGLang and TensorRT-LLM, production deployment patterns, and configuration tuning that actually moves the numbers.
Throughput Benchmarks
Throughput is measured in output tokens per second. The number that determines how many concurrent users your deployment can serve before latency degrades. All numbers below are from third-party benchmarks published in 2025-2026.
| Model | Hardware | Throughput | Config |
|---|---|---|---|
| Llama 3.1 8B (BF16) | 1x H100 80GB | ~12,500 tok/s | FlashInfer, 0.8 GPU util |
| GPT-OSS-120B | 2x H100 80GB | 4,741 tok/s | 100 concurrent requests |
| DeepSeek-R1-Distill-Qwen-7B | 1x A100 80GB | 3,363 tok/s | Default config |
| DeepSeek-R1-Distill-Qwen-14B | 1x A100 80GB | 3,004 tok/s | Default config |
| DeepSeek-R1-Distill-Qwen-32B | 1x A100 80GB | 577 tok/s | Default config |
| Llama 3.1 70B (FP8) | 1x H100 80GB | 460 tok/s | Batch size 64 |
The pattern is consistent: vLLM saturates memory bandwidth on smaller models (8B-14B), delivering 3,000-12,500 tok/s depending on hardware. Larger models (70B+) shift from memory-bound to compute-bound, and throughput drops accordingly. Tensor parallelism across multiple GPUs recovers throughput proportionally.
Why throughput varies so much
Published throughput numbers are shaped by batch size, input/output token ratio, quantization, and whether prefix caching hits. A benchmark showing 12,500 tok/s with Llama 8B on H100 uses different conditions than one showing 3,363 tok/s with DeepSeek-R1 7B on A100. Always compare within the same hardware and workload.
Latency Benchmarks: TTFT, ITL, P99
Three latency metrics define the user experience. TTFT (Time to First Token) is how long before the first token streams back. ITL (Inter-Token Latency) is the gap between consecutive tokens during generation. P99 is the worst-case latency seen by 99% of requests.
| Metric | Value | Conditions |
|---|---|---|
| Mean TTFT | 72 ms | Llama 8B, 1xH100, low concurrency |
| P99 TTFT | 79 ms | Llama 8B, 1xH100, low concurrency |
| Mean TTFT | 123 ms | Llama 3.1 70B FP8, 1xH100 |
| Mean TTFT | 261 ms | GPT-OSS-120B, H100, 32 concurrent |
| Median ITL | 11 ms | GPT-OSS-120B, H100, 32 concurrent |
| Mean ITL | 17 ms | GPT-OSS-120B, H100, 32 concurrent |
| P99 Latency | 80 ms | Peak throughput vs Ollama (673 ms) |
TTFT scales with model size and concurrency. An 8B model on H100 delivers sub-80ms TTFT. A 120B model under 32 concurrent requests pushes TTFT to 261ms. ITL stays remarkably stable at 11-21ms across workloads, because decode is memory-bound and predictable once prefill completes.
For interactive applications (chat, coding assistants), TTFT under 200ms and ITL under 30ms produce a fluid streaming experience. vLLM consistently hits these targets on H100 hardware for models up to 70B parameters.
vLLM vs SGLang vs TensorRT-LLM
Three inference engines dominate production LLM serving in 2026. Each optimizes for different tradeoffs. The benchmarks below use H100 hardware with comparable configurations.
| Dimension | vLLM | SGLang | TensorRT-LLM |
|---|---|---|---|
| Throughput (Llama 8B) | ~12,500 tok/s | ~16,200 tok/s | ~14,400 tok/s* |
| TTFT (lowest) | Fastest across concurrency levels | Competitive | Slowest (compilation overhead) |
| ITL stability | Good | Best (4-21ms range) | Good at low concurrency |
| Prefix caching | PagedAttention, APC | RadixAttention (75-95% hit rates) | Limited |
| Model swap time | Seconds | Seconds | 10-30 min compilation |
| Hardware support | NVIDIA, AMD, TPU, Trainium, Gaudi | NVIDIA, AMD | NVIDIA only |
| Model coverage | Broadest (+ Transformers fallback) | Good | Limited (requires conversion) |
| Contributor base | 3x larger | Growing | NVIDIA-maintained |
* TensorRT-LLM throughput estimated from compiled FP8 benchmarks. Actual numbers depend on compilation target.
GPT-OSS-120B on 2xH100 (Clarifai Benchmark)
Clarifai benchmarked all three engines serving GPT-OSS-120B on 2x H100 GPUs across concurrency levels from 1 to 100:
- vLLM reached 4,741 tok/s at 100 concurrent requests and had the fastest TTFT at every concurrency level.
- SGLang showed the most stable ITL (4-21ms) and strong performance at moderate concurrency (50 requests).
- TensorRT-LLM had the best single-request throughput but scaled worse at high concurrency and showed the slowest TTFT.
When to pick which
Pick vLLM
Fast iteration, multi-model serving, broad hardware support, or production without a compilation step. Best default for most teams.
Pick SGLang
Multi-turn chat, RAG pipelines, or any workload with shared prefixes. RadixAttention's 75-95% cache hit rates add 10-20% throughput over vLLM for these patterns.
Pick TensorRT-LLM
Single model in long-term production on NVIDIA hardware where you can absorb a 10-30 minute compilation step. Highest peak throughput when model won't change.
How vLLM Works: PagedAttention + Continuous Batching
Two innovations explain why vLLM outperforms naive serving by up to 24x. Understanding them is necessary for tuning.
PagedAttention
Traditional serving engines pre-allocate contiguous GPU memory for each sequence based on maximum possible length. If max context is 128K tokens but the average request uses 2K, 98% of allocated memory sits empty. This internal fragmentation wastes 60-80% of KV cache memory in practice.
PagedAttention borrows the idea of virtual memory from operating systems. It breaks the KV cache into fixed-size blocks (typically 16 tokens each) that can be stored anywhere in GPU memory. A block table maps logical positions to physical memory, and blocks are allocated on demand as the sequence grows. The result: under 4% memory waste. More sequences fit in GPU memory, which means larger effective batch sizes and higher throughput.
Continuous Batching
Static batching groups N requests, runs them to completion, then accepts the next batch. If one request generates 1,000 tokens while the others generate 10, the GPU idles waiting for the slow request.
vLLM schedules at the token level. After each generation step, it evicts finished sequences and inserts waiting requests into the batch. No GPU cycles wasted waiting for the slowest request. All sequences are flattened into a single "super sequence" with position indices and attention masks ensuring each sequence only attends to its own tokens.
The combination of PagedAttention (more sequences fit) and continuous batching (no idle GPU time) is what produces vLLM's throughput advantage. The system maintains large effective batch sizes while keeping latency low for individual requests.
vLLM V1: The Architectural Reset
Released in January 2025 and the default since v0.8.0, V1 is a ground-up rewrite of vLLM's core engine. The team revisited key design decisions from 1.5 years of production experience.
Key V1 improvements
- Incremental state updates. V1 caches request states on workers and transmits only diffs at each step, reducing scheduler-worker communication overhead.
- Near-free prefix caching. V1's prefix caching causes less than 1% throughput decrease even at 0% cache hit rate. At high hit rates, it multiplies throughput.
- FlashAttention 3 integration. Handles mixing prefill and decode within the same batch efficiently, supporting chunked prefill without separate kernel launches.
- Vision-language model support. An "encoder cache" stores vision embeddings, allowing text inputs to be chunked and processed across steps without regenerating vision encodings.
- Async scheduling with spec decode. Zero-bubble overlap between scheduling and speculative decoding, added in v0.19.
v0.19.0 (April 2026)
The latest release adds Gemma 4 support (MoE, multimodal, reasoning, tool use), piecewise CUDA graphs for pipeline parallelism, full CUDA graph capture for vision encoders (ViT), and Intel XPU CUDA graph support. Pooling throughput improved 13.9%.
Supported Models
vLLM supports the broadest set of model architectures among open inference engines. Any model using a supported architecture loads directly from HuggingFace. Models using unsupported architectures fall back to the Transformers backend with less than 5% performance penalty.
| Family | Architectures | Notable Models |
|---|---|---|
| Llama | LlamaForCausalLM | Llama 3.1, Llama 3.3, Llama 4, Code Llama |
| DeepSeek | DeepseekV2ForCausalLM, DeepseekV3ForCausalLM | DeepSeek-V3, DeepSeek-R1, DeepSeek-Coder-V2 |
| Qwen | Qwen2ForCausalLM, Qwen3ForCausalLM | Qwen 2.5, Qwen 3, Qwen-Coder, Qwen-VL |
| Mistral / Mixtral | MistralForCausalLM, MixtralForCausalLM | Mistral 7B, Mixtral 8x7B, Mixtral 8x22B |
| Gemma | Gemma2ForCausalLM, Gemma4ForCausalLM | Gemma 2, Gemma 4 (MoE, multimodal) |
| Phi | PhiForCausalLM, Phi3ForCausalLM | Phi-3, Phi-3.5, Phi-4 |
| Command-R | CohereForCausalLM, Cohere2ForCausalLM | Command-R, Command-R+ |
| Multimodal | LlavaForConditionalGeneration, etc. | LLaVA, Pixtral, InternVL, Qwen-VL |
Beyond text generation, vLLM supports embedding models (E5-Mistral), classification, reward models, and pooling tasks. The full list is maintained at docs.vllm.ai/en/latest/models/supported_models.
Production Deployment
vLLM ships as a Docker image with an OpenAI-compatible HTTP server. Single-node deployment is one command. Cluster-wide deployment uses the official production-stack Helm chart.
Docker: Single-Node
Run vLLM with Docker
docker run -d \
--name vllm-server \
--gpus all \
--ipc=host \
--shm-size=16g \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_API_KEY=your-secret-key \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--enable-prefix-caching \
--served-model-name llama-70bKey flags: --ipc=host enables NCCL shared memory for multi-GPU communication. --shm-size=16g allocates shared memory. --gpu-memory-utilization 0.90 leaves headroom for CUDA context. --max-model-len caps context length, freeing memory for larger batch sizes.
Kubernetes: Production Stack
The vLLM production-stack (released January 2025) provides Kubernetes-native cluster deployment with Helm charts. It adds request routing, multi-model support, KV cache offloading, and autoscaling.
Deploy with Helm
helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm install vllm-stack vllm/vllm-stack \
--set models[0].name=llama-70b \
--set models[0].modelId=meta-llama/Llama-3.1-70B-Instruct \
--set models[0].tensorParallelSize=4 \
--set models[0].gpuMemoryUtilization=0.90 \
--set monitoring.enabled=true \
--set autoscaling.enabled=true \
--set autoscaling.minReplicas=2 \
--set autoscaling.maxReplicas=8Monitoring
vLLM exposes a Prometheus-compatible /metrics endpoint. Critical metrics to monitor:
vllm:num_requests_runningandvllm:num_requests_waitingfor queue depthvllm:gpu_cache_usage_percfor KV cache utilizationvllm:time_to_first_token_secondsfor TTFT distributionvllm:inter_token_latency_secondsfor ITL trackingvllm:gpu_prefix_cache_hit_ratefor prefix caching effectiveness
Autoscaling with KEDA
KEDA scales vLLM replicas based on per-replica queue depth from Prometheus. Scale on vllm:num_requests_waiting / num_replicas rather than CPU or memory. This reacts to actual inference demand, not proxy metrics. Prerequisites: Kubernetes 1.27+, NVIDIA GPU Operator, KEDA v2.x, cert-manager.
Configuration Tuning
Default vLLM settings work. But tuning five parameters can double throughput for your specific workload. Benchmark with benchmark_serving.py under realistic traffic before and after each change.
Tensor Parallelism
--tensor-parallel-size N shards the model across N GPUs within a node. Each GPU holds 1/N of the weights and gets more memory for KV cache. Set to the minimum number of GPUs that fit your model in memory. Going higher adds synchronization overhead.
Chunked Prefill
Enabled by default in V1. Splits large prefill operations into chunks and batches them with decode requests. This prevents a single long-context request from blocking all decode operations. The --max-num-batched-tokens parameter controls the total token budget per scheduler step. Higher values favor throughput. Lower values favor latency.
Speculative Decoding
Uses a small draft model to propose multiple tokens, then verifies them in a single forward pass of the target model. Reduces latency by 1.5-2.8x depending on workload. Methods supported: draft models, EAGLE/EAGLE-3, P-EAGLE, n-gram matching, suffix decoding, MLP speculators.
| Method | Speedup | Best For |
|---|---|---|
| Draft model | 1.5x | General workloads, ShareGPT-like traffic |
| Prompt lookup | Up to 2.8x | Summarization, repetitive patterns |
| EAGLE-3 | Up to 2.1x | RAG, math reasoning (training-data-similar tasks) |
| P-EAGLE (B200) | 1.69x over EAGLE-3 | HumanEval, SPEED-Bench (latest NVIDIA hardware) |
| Suffix decoding | ~91% of theoretical max | CPU-based, no draft model needed, 20µs/token |
Prefix Caching (APC)
Automatic Prefix Caching caches KV blocks across requests that share token prefixes. A system prompt shared by all requests gets computed once, then reused. Multi-turn conversations reuse the entire chat history from prior turns.
Enable with --enable-prefix-caching. In V1, the overhead at 0% hit rate is under 1%. At high hit rates, TTFT drops dramatically because shared prefixes skip prefill entirely. Particularly effective for RAG (shared document context), multi-turn chat, and coding agents (shared system prompts + file context).
GPU Memory Utilization
--gpu-memory-utilization controls what fraction of GPU memory vLLM claims for KV cache. Default is 0.9. Setting it higher (0.95) squeezes more concurrent requests but risks OOM under memory pressure. Setting it lower (0.8) provides headroom for spiky workloads. Benchmark your specific model to find the right value.
Quantization: FP8, AWQ, GPTQ
Quantization reduces model memory footprint, allowing larger models on fewer GPUs and increasing throughput by reducing memory bandwidth requirements. vLLM supports multiple quantization formats, each with different speed-quality tradeoffs.
| Format | Throughput Impact | Quality Impact | Best For |
|---|---|---|---|
| FP8 (W8A8) | Up to 1.6x improvement | Minimal accuracy loss | H100/B200 with native FP8 support |
| AWQ (Marlin) | 741 tok/s (10.9x speedup with Marlin kernels) | 51.8% Pass@1 on HumanEval | Best speed-quality tradeoff for coding |
| GPTQ (Marlin) | ~3x vs BF16 (2.6x speedup with Marlin) | Better on real-world benchmarks | Coding tasks, complex reasoning |
| BF16 (baseline) | 1x (baseline) | Full precision | When accuracy matters most |
The key insight: AWQ and GPTQ with Marlin kernels deliver 3x+ throughput improvements over full-precision BF16, because Marlin fuses dequantization into the matrix multiplication kernel. Without Marlin, weight-only quantization can actually decrease throughput due to dequantization overhead. Always benchmark with and without Marlin.
Serve a quantized model
# FP8 quantization (H100/B200 native)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--quantization fp8 \
--tensor-parallel-size 2
# AWQ quantization
vllm serve TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--enable-prefix-caching
# GPTQ with Marlin kernels (auto-detected)
vllm serve TheBloke/Llama-2-70B-GPTQ \
--quantization gptqFP8 is the production default on Hopper+
On H100 and B200 GPUs with native FP8 tensor cores, FP8 quantization is nearly free: 2x memory reduction, up to 1.6x throughput gain, minimal accuracy loss. No model conversion needed. vLLM handles dynamic FP8 quantization at serving time. For static FP8 (higher throughput), pre-quantize with tools like AMD Quark or NVIDIA TensorRT Model Optimizer.
When to Use vLLM
vLLM is the right choice for most self-hosted inference deployments. But "most" is not "all." Here is where vLLM excels and where alternatives are stronger.
vLLM excels at
Multi-model serving (swap models without recompilation). Broad hardware support (NVIDIA, AMD, TPU, Trainium, Gaudi, CPU). Fast iteration (no compilation step). Teams that need production today with the widest model compatibility. OpenAI API compatibility for drop-in replacement.
Consider alternatives when
Multi-turn workloads with heavy prefix sharing (SGLang's RadixAttention wins). Single-model, maximum-throughput on NVIDIA hardware (TensorRT-LLM after compilation). Local development on consumer hardware (Ollama/llama.cpp). Heavily quantized small models on CPU (llama.cpp).
A common production pattern: use vLLM as the default serving engine, benchmark SGLang for specific high-traffic multi-turn endpoints, and deploy TensorRT-LLM only for the one model that handles 80% of your traffic and hasn't changed in months.
vLLM for Coding Agents
Coding agents are one of the most demanding inference workloads. Each agent turn involves a system prompt (often 10K+ tokens of codebase context), tool calls, and multi-turn conversation with the model. A single coding task might require 10-50 model calls. The inference backend needs high throughput, low TTFT, and effective prefix caching.
vLLM handles this well. The OpenAI-compatible API means any agent framework (LangChain, PydanticAI, custom) that uses the OpenAI Python client works out of the box by changing the base URL. Prefix caching reuses the system prompt and accumulated context across turns, reducing TTFT by 60-90% for the shared prefix. Tool use parsing is supported with configurable tool parsers.
Point a coding agent at vLLM
from openai import OpenAI
# Point to your vLLM server instead of OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-vllm-api-key",
)
response = client.chat.completions.create(
model="llama-70b",
messages=[
{"role": "system", "content": system_prompt}, # cached after first call
{"role": "user", "content": "Refactor the auth middleware to use JWT"},
],
tools=tool_definitions,
stream=True,
)Self-hosted vLLM gives you full control over the inference stack: model choice, quantization, hardware allocation, and cost. But it requires GPU procurement, operational overhead, and performance tuning. For teams that want the agent capabilities without managing infrastructure, Morph's APIs provide managed inference with optimizations purpose-built for coding workflows, including the Fast Apply model that handles LLM-generated code edits at 10,500+ tok/s.
Self-hosted vs managed: the tradeoff
Self-hosted vLLM gives you cost control at scale and data sovereignty. Managed APIs (Morph, Together, Fireworks) give you zero operational overhead and features like optimized code editing models. Many teams run both: managed APIs for development and low-volume endpoints, self-hosted vLLM for high-traffic production workloads where GPU costs are predictable.
FAQ
What is vLLM?
A high-throughput, memory-efficient inference engine for LLMs. Uses PagedAttention for GPU memory management, continuous batching for hardware utilization, and exposes an OpenAI-compatible API. Open-source, maintained by the vLLM team with 1,000+ contributors.
How fast is vLLM on H100?
Approximately 12,500 tok/s for Llama 3.1 8B (BF16) on a single H100 80GB. For larger models: 4,741 tok/s for GPT-OSS-120B on 2xH100 at 100 concurrent requests. TTFT ranges from 72ms (8B, low concurrency) to 261ms (120B, 32 concurrent).
How does vLLM compare to SGLang?
SGLang delivers ~29% higher throughput on H100 (16,200 vs 12,500 tok/s for Llama 8B) thanks to RadixAttention. vLLM wins on hardware support (TPUs, Trainium, Gaudi), model coverage, and ecosystem maturity. Pick SGLang for multi-turn workloads with shared prefixes. Pick vLLM for everything else.
How does vLLM compare to TensorRT-LLM?
TensorRT-LLM achieves 15-30% higher peak throughput after a 10-30 minute compilation step. NVIDIA-only. Use it for a single model that won't change. Use vLLM when you need model flexibility, multi-hardware support, or fast deployment.
What models does vLLM support?
Most open-source architectures: Llama, DeepSeek, Qwen, Mistral, Mixtral, Gemma, Phi, Command-R, and many more. Plus multimodal models (LLaVA, Pixtral, Qwen-VL) and embedding models. Unsupported architectures fall back to Transformers with under 5% performance penalty.
What quantization should I use?
FP8 on H100/B200 (native support, minimal quality loss, up to 1.6x throughput). AWQ with Marlin kernels for the best speed-quality tradeoff on older hardware. GPTQ with Marlin for coding tasks where GPTQ shows better real-world accuracy than AWQ.
How do I deploy vLLM in production?
Docker for single-node: docker run --gpus all vllm/vllm-openai --model your-model. Kubernetes for clusters: use the vLLM production-stack Helm chart with KEDA autoscaling and Prometheus monitoring. Monitor TTFT P99, KV cache utilization, and request queue depth.
What is vLLM V1?
A rewrite of vLLM's core architecture, default since v0.8.0. Delivers 1.7x throughput over V0, integrates FlashAttention 3, and makes prefix caching nearly free (<1% overhead at 0% hit rate). The latest release is v0.19.0 (April 2026).
Inference for Coding Agents
vLLM serves the model. Morph handles what comes after: Fast Apply merges LLM-generated code edits into your codebase at 10,500+ tok/s with deterministic behavior. WarpGrep provides semantic code search purpose-built for agent tool calls. The layer between your inference engine and your repository.