vLLM Benchmarks: Throughput, Latency, and the Production Stack That Handles 10K req/s

Comprehensive vLLM benchmarks covering throughput, TTFT, ITL, and P99 latency on H100 and A100 GPUs. Comparison to SGLang and TensorRT-LLM, production deployment with Docker and Kubernetes, supported models, configuration tuning, and when vLLM is the right inference engine for your stack.

April 5, 2026 · 1 min read

vLLM is the default inference engine for most teams self-hosting LLMs. PagedAttention, continuous batching, an OpenAI-compatible API, and broad model support make it the fastest path from "download a model" to "serve 1,000 concurrent users." This page collects the benchmarks that matter: throughput, TTFT, ITL, and P99 latency across H100 and A100 hardware. Plus comparisons to SGLang and TensorRT-LLM, production deployment patterns, and configuration tuning that actually moves the numbers.

Throughput Benchmarks

Throughput is measured in output tokens per second. The number that determines how many concurrent users your deployment can serve before latency degrades. All numbers below are from third-party benchmarks published in 2025-2026.

12.5K
tok/s, Llama 3.1 8B, 1xH100
4,741
tok/s, GPT-OSS-120B, 2xH100
3,363
tok/s, DeepSeek-R1 7B, 1xA100
1.7x
V1 vs V0 throughput gain
ModelHardwareThroughputConfig
Llama 3.1 8B (BF16)1x H100 80GB~12,500 tok/sFlashInfer, 0.8 GPU util
GPT-OSS-120B2x H100 80GB4,741 tok/s100 concurrent requests
DeepSeek-R1-Distill-Qwen-7B1x A100 80GB3,363 tok/sDefault config
DeepSeek-R1-Distill-Qwen-14B1x A100 80GB3,004 tok/sDefault config
DeepSeek-R1-Distill-Qwen-32B1x A100 80GB577 tok/sDefault config
Llama 3.1 70B (FP8)1x H100 80GB460 tok/sBatch size 64

The pattern is consistent: vLLM saturates memory bandwidth on smaller models (8B-14B), delivering 3,000-12,500 tok/s depending on hardware. Larger models (70B+) shift from memory-bound to compute-bound, and throughput drops accordingly. Tensor parallelism across multiple GPUs recovers throughput proportionally.

Why throughput varies so much

Published throughput numbers are shaped by batch size, input/output token ratio, quantization, and whether prefix caching hits. A benchmark showing 12,500 tok/s with Llama 8B on H100 uses different conditions than one showing 3,363 tok/s with DeepSeek-R1 7B on A100. Always compare within the same hardware and workload.

Latency Benchmarks: TTFT, ITL, P99

Three latency metrics define the user experience. TTFT (Time to First Token) is how long before the first token streams back. ITL (Inter-Token Latency) is the gap between consecutive tokens during generation. P99 is the worst-case latency seen by 99% of requests.

MetricValueConditions
Mean TTFT72 msLlama 8B, 1xH100, low concurrency
P99 TTFT79 msLlama 8B, 1xH100, low concurrency
Mean TTFT123 msLlama 3.1 70B FP8, 1xH100
Mean TTFT261 msGPT-OSS-120B, H100, 32 concurrent
Median ITL11 msGPT-OSS-120B, H100, 32 concurrent
Mean ITL17 msGPT-OSS-120B, H100, 32 concurrent
P99 Latency80 msPeak throughput vs Ollama (673 ms)

TTFT scales with model size and concurrency. An 8B model on H100 delivers sub-80ms TTFT. A 120B model under 32 concurrent requests pushes TTFT to 261ms. ITL stays remarkably stable at 11-21ms across workloads, because decode is memory-bound and predictable once prefill completes.

For interactive applications (chat, coding assistants), TTFT under 200ms and ITL under 30ms produce a fluid streaming experience. vLLM consistently hits these targets on H100 hardware for models up to 70B parameters.

vLLM vs SGLang vs TensorRT-LLM

Three inference engines dominate production LLM serving in 2026. Each optimizes for different tradeoffs. The benchmarks below use H100 hardware with comparable configurations.

DimensionvLLMSGLangTensorRT-LLM
Throughput (Llama 8B)~12,500 tok/s~16,200 tok/s~14,400 tok/s*
TTFT (lowest)Fastest across concurrency levelsCompetitiveSlowest (compilation overhead)
ITL stabilityGoodBest (4-21ms range)Good at low concurrency
Prefix cachingPagedAttention, APCRadixAttention (75-95% hit rates)Limited
Model swap timeSecondsSeconds10-30 min compilation
Hardware supportNVIDIA, AMD, TPU, Trainium, GaudiNVIDIA, AMDNVIDIA only
Model coverageBroadest (+ Transformers fallback)GoodLimited (requires conversion)
Contributor base3x largerGrowingNVIDIA-maintained

* TensorRT-LLM throughput estimated from compiled FP8 benchmarks. Actual numbers depend on compilation target.

GPT-OSS-120B on 2xH100 (Clarifai Benchmark)

Clarifai benchmarked all three engines serving GPT-OSS-120B on 2x H100 GPUs across concurrency levels from 1 to 100:

  • vLLM reached 4,741 tok/s at 100 concurrent requests and had the fastest TTFT at every concurrency level.
  • SGLang showed the most stable ITL (4-21ms) and strong performance at moderate concurrency (50 requests).
  • TensorRT-LLM had the best single-request throughput but scaled worse at high concurrency and showed the slowest TTFT.

When to pick which

Pick vLLM

Fast iteration, multi-model serving, broad hardware support, or production without a compilation step. Best default for most teams.

Pick SGLang

Multi-turn chat, RAG pipelines, or any workload with shared prefixes. RadixAttention's 75-95% cache hit rates add 10-20% throughput over vLLM for these patterns.

Pick TensorRT-LLM

Single model in long-term production on NVIDIA hardware where you can absorb a 10-30 minute compilation step. Highest peak throughput when model won't change.

How vLLM Works: PagedAttention + Continuous Batching

Two innovations explain why vLLM outperforms naive serving by up to 24x. Understanding them is necessary for tuning.

PagedAttention

Traditional serving engines pre-allocate contiguous GPU memory for each sequence based on maximum possible length. If max context is 128K tokens but the average request uses 2K, 98% of allocated memory sits empty. This internal fragmentation wastes 60-80% of KV cache memory in practice.

PagedAttention borrows the idea of virtual memory from operating systems. It breaks the KV cache into fixed-size blocks (typically 16 tokens each) that can be stored anywhere in GPU memory. A block table maps logical positions to physical memory, and blocks are allocated on demand as the sequence grows. The result: under 4% memory waste. More sequences fit in GPU memory, which means larger effective batch sizes and higher throughput.

Continuous Batching

Static batching groups N requests, runs them to completion, then accepts the next batch. If one request generates 1,000 tokens while the others generate 10, the GPU idles waiting for the slow request.

vLLM schedules at the token level. After each generation step, it evicts finished sequences and inserts waiting requests into the batch. No GPU cycles wasted waiting for the slowest request. All sequences are flattened into a single "super sequence" with position indices and attention masks ensuring each sequence only attends to its own tokens.

The combination of PagedAttention (more sequences fit) and continuous batching (no idle GPU time) is what produces vLLM's throughput advantage. The system maintains large effective batch sizes while keeping latency low for individual requests.

vLLM V1: The Architectural Reset

Released in January 2025 and the default since v0.8.0, V1 is a ground-up rewrite of vLLM's core engine. The team revisited key design decisions from 1.5 years of production experience.

1.7x
Throughput vs V0
<1%
Prefix caching overhead at 0% hits
FA3
FlashAttention 3 integration
v0.19
Latest release (Apr 2026)

Key V1 improvements

  • Incremental state updates. V1 caches request states on workers and transmits only diffs at each step, reducing scheduler-worker communication overhead.
  • Near-free prefix caching. V1's prefix caching causes less than 1% throughput decrease even at 0% cache hit rate. At high hit rates, it multiplies throughput.
  • FlashAttention 3 integration. Handles mixing prefill and decode within the same batch efficiently, supporting chunked prefill without separate kernel launches.
  • Vision-language model support. An "encoder cache" stores vision embeddings, allowing text inputs to be chunked and processed across steps without regenerating vision encodings.
  • Async scheduling with spec decode. Zero-bubble overlap between scheduling and speculative decoding, added in v0.19.

v0.19.0 (April 2026)

The latest release adds Gemma 4 support (MoE, multimodal, reasoning, tool use), piecewise CUDA graphs for pipeline parallelism, full CUDA graph capture for vision encoders (ViT), and Intel XPU CUDA graph support. Pooling throughput improved 13.9%.

Supported Models

vLLM supports the broadest set of model architectures among open inference engines. Any model using a supported architecture loads directly from HuggingFace. Models using unsupported architectures fall back to the Transformers backend with less than 5% performance penalty.

FamilyArchitecturesNotable Models
LlamaLlamaForCausalLMLlama 3.1, Llama 3.3, Llama 4, Code Llama
DeepSeekDeepseekV2ForCausalLM, DeepseekV3ForCausalLMDeepSeek-V3, DeepSeek-R1, DeepSeek-Coder-V2
QwenQwen2ForCausalLM, Qwen3ForCausalLMQwen 2.5, Qwen 3, Qwen-Coder, Qwen-VL
Mistral / MixtralMistralForCausalLM, MixtralForCausalLMMistral 7B, Mixtral 8x7B, Mixtral 8x22B
GemmaGemma2ForCausalLM, Gemma4ForCausalLMGemma 2, Gemma 4 (MoE, multimodal)
PhiPhiForCausalLM, Phi3ForCausalLMPhi-3, Phi-3.5, Phi-4
Command-RCohereForCausalLM, Cohere2ForCausalLMCommand-R, Command-R+
MultimodalLlavaForConditionalGeneration, etc.LLaVA, Pixtral, InternVL, Qwen-VL

Beyond text generation, vLLM supports embedding models (E5-Mistral), classification, reward models, and pooling tasks. The full list is maintained at docs.vllm.ai/en/latest/models/supported_models.

Production Deployment

vLLM ships as a Docker image with an OpenAI-compatible HTTP server. Single-node deployment is one command. Cluster-wide deployment uses the official production-stack Helm chart.

Docker: Single-Node

Run vLLM with Docker

docker run -d \
  --name vllm-server \
  --gpus all \
  --ipc=host \
  --shm-size=16g \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_API_KEY=your-secret-key \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --enable-prefix-caching \
  --served-model-name llama-70b

Key flags: --ipc=host enables NCCL shared memory for multi-GPU communication. --shm-size=16g allocates shared memory. --gpu-memory-utilization 0.90 leaves headroom for CUDA context. --max-model-len caps context length, freeing memory for larger batch sizes.

Kubernetes: Production Stack

The vLLM production-stack (released January 2025) provides Kubernetes-native cluster deployment with Helm charts. It adds request routing, multi-model support, KV cache offloading, and autoscaling.

Deploy with Helm

helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update

helm install vllm-stack vllm/vllm-stack \
  --set models[0].name=llama-70b \
  --set models[0].modelId=meta-llama/Llama-3.1-70B-Instruct \
  --set models[0].tensorParallelSize=4 \
  --set models[0].gpuMemoryUtilization=0.90 \
  --set monitoring.enabled=true \
  --set autoscaling.enabled=true \
  --set autoscaling.minReplicas=2 \
  --set autoscaling.maxReplicas=8

Monitoring

vLLM exposes a Prometheus-compatible /metrics endpoint. Critical metrics to monitor:

  • vllm:num_requests_running and vllm:num_requests_waiting for queue depth
  • vllm:gpu_cache_usage_perc for KV cache utilization
  • vllm:time_to_first_token_seconds for TTFT distribution
  • vllm:inter_token_latency_seconds for ITL tracking
  • vllm:gpu_prefix_cache_hit_rate for prefix caching effectiveness

Autoscaling with KEDA

KEDA scales vLLM replicas based on per-replica queue depth from Prometheus. Scale on vllm:num_requests_waiting / num_replicas rather than CPU or memory. This reacts to actual inference demand, not proxy metrics. Prerequisites: Kubernetes 1.27+, NVIDIA GPU Operator, KEDA v2.x, cert-manager.

Configuration Tuning

Default vLLM settings work. But tuning five parameters can double throughput for your specific workload. Benchmark with benchmark_serving.py under realistic traffic before and after each change.

Tensor Parallelism

--tensor-parallel-size N shards the model across N GPUs within a node. Each GPU holds 1/N of the weights and gets more memory for KV cache. Set to the minimum number of GPUs that fit your model in memory. Going higher adds synchronization overhead.

Chunked Prefill

Enabled by default in V1. Splits large prefill operations into chunks and batches them with decode requests. This prevents a single long-context request from blocking all decode operations. The --max-num-batched-tokens parameter controls the total token budget per scheduler step. Higher values favor throughput. Lower values favor latency.

Speculative Decoding

Uses a small draft model to propose multiple tokens, then verifies them in a single forward pass of the target model. Reduces latency by 1.5-2.8x depending on workload. Methods supported: draft models, EAGLE/EAGLE-3, P-EAGLE, n-gram matching, suffix decoding, MLP speculators.

MethodSpeedupBest For
Draft model1.5xGeneral workloads, ShareGPT-like traffic
Prompt lookupUp to 2.8xSummarization, repetitive patterns
EAGLE-3Up to 2.1xRAG, math reasoning (training-data-similar tasks)
P-EAGLE (B200)1.69x over EAGLE-3HumanEval, SPEED-Bench (latest NVIDIA hardware)
Suffix decoding~91% of theoretical maxCPU-based, no draft model needed, 20µs/token

Prefix Caching (APC)

Automatic Prefix Caching caches KV blocks across requests that share token prefixes. A system prompt shared by all requests gets computed once, then reused. Multi-turn conversations reuse the entire chat history from prior turns.

Enable with --enable-prefix-caching. In V1, the overhead at 0% hit rate is under 1%. At high hit rates, TTFT drops dramatically because shared prefixes skip prefill entirely. Particularly effective for RAG (shared document context), multi-turn chat, and coding agents (shared system prompts + file context).

GPU Memory Utilization

--gpu-memory-utilization controls what fraction of GPU memory vLLM claims for KV cache. Default is 0.9. Setting it higher (0.95) squeezes more concurrent requests but risks OOM under memory pressure. Setting it lower (0.8) provides headroom for spiky workloads. Benchmark your specific model to find the right value.

Quantization: FP8, AWQ, GPTQ

Quantization reduces model memory footprint, allowing larger models on fewer GPUs and increasing throughput by reducing memory bandwidth requirements. vLLM supports multiple quantization formats, each with different speed-quality tradeoffs.

FormatThroughput ImpactQuality ImpactBest For
FP8 (W8A8)Up to 1.6x improvementMinimal accuracy lossH100/B200 with native FP8 support
AWQ (Marlin)741 tok/s (10.9x speedup with Marlin kernels)51.8% Pass@1 on HumanEvalBest speed-quality tradeoff for coding
GPTQ (Marlin)~3x vs BF16 (2.6x speedup with Marlin)Better on real-world benchmarksCoding tasks, complex reasoning
BF16 (baseline)1x (baseline)Full precisionWhen accuracy matters most

The key insight: AWQ and GPTQ with Marlin kernels deliver 3x+ throughput improvements over full-precision BF16, because Marlin fuses dequantization into the matrix multiplication kernel. Without Marlin, weight-only quantization can actually decrease throughput due to dequantization overhead. Always benchmark with and without Marlin.

Serve a quantized model

# FP8 quantization (H100/B200 native)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 2

# AWQ quantization
vllm serve TheBloke/Llama-2-70B-AWQ \
  --quantization awq \
  --enable-prefix-caching

# GPTQ with Marlin kernels (auto-detected)
vllm serve TheBloke/Llama-2-70B-GPTQ \
  --quantization gptq

FP8 is the production default on Hopper+

On H100 and B200 GPUs with native FP8 tensor cores, FP8 quantization is nearly free: 2x memory reduction, up to 1.6x throughput gain, minimal accuracy loss. No model conversion needed. vLLM handles dynamic FP8 quantization at serving time. For static FP8 (higher throughput), pre-quantize with tools like AMD Quark or NVIDIA TensorRT Model Optimizer.

When to Use vLLM

vLLM is the right choice for most self-hosted inference deployments. But "most" is not "all." Here is where vLLM excels and where alternatives are stronger.

vLLM excels at

Multi-model serving (swap models without recompilation). Broad hardware support (NVIDIA, AMD, TPU, Trainium, Gaudi, CPU). Fast iteration (no compilation step). Teams that need production today with the widest model compatibility. OpenAI API compatibility for drop-in replacement.

Consider alternatives when

Multi-turn workloads with heavy prefix sharing (SGLang's RadixAttention wins). Single-model, maximum-throughput on NVIDIA hardware (TensorRT-LLM after compilation). Local development on consumer hardware (Ollama/llama.cpp). Heavily quantized small models on CPU (llama.cpp).

A common production pattern: use vLLM as the default serving engine, benchmark SGLang for specific high-traffic multi-turn endpoints, and deploy TensorRT-LLM only for the one model that handles 80% of your traffic and hasn't changed in months.

vLLM for Coding Agents

Coding agents are one of the most demanding inference workloads. Each agent turn involves a system prompt (often 10K+ tokens of codebase context), tool calls, and multi-turn conversation with the model. A single coding task might require 10-50 model calls. The inference backend needs high throughput, low TTFT, and effective prefix caching.

vLLM handles this well. The OpenAI-compatible API means any agent framework (LangChain, PydanticAI, custom) that uses the OpenAI Python client works out of the box by changing the base URL. Prefix caching reuses the system prompt and accumulated context across turns, reducing TTFT by 60-90% for the shared prefix. Tool use parsing is supported with configurable tool parsers.

Point a coding agent at vLLM

from openai import OpenAI

# Point to your vLLM server instead of OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-vllm-api-key",
)

response = client.chat.completions.create(
    model="llama-70b",
    messages=[
        {"role": "system", "content": system_prompt},  # cached after first call
        {"role": "user", "content": "Refactor the auth middleware to use JWT"},
    ],
    tools=tool_definitions,
    stream=True,
)

Self-hosted vLLM gives you full control over the inference stack: model choice, quantization, hardware allocation, and cost. But it requires GPU procurement, operational overhead, and performance tuning. For teams that want the agent capabilities without managing infrastructure, Morph's APIs provide managed inference with optimizations purpose-built for coding workflows, including the Fast Apply model that handles LLM-generated code edits at 10,500+ tok/s.

Self-hosted vs managed: the tradeoff

Self-hosted vLLM gives you cost control at scale and data sovereignty. Managed APIs (Morph, Together, Fireworks) give you zero operational overhead and features like optimized code editing models. Many teams run both: managed APIs for development and low-volume endpoints, self-hosted vLLM for high-traffic production workloads where GPU costs are predictable.

FAQ

What is vLLM?

A high-throughput, memory-efficient inference engine for LLMs. Uses PagedAttention for GPU memory management, continuous batching for hardware utilization, and exposes an OpenAI-compatible API. Open-source, maintained by the vLLM team with 1,000+ contributors.

How fast is vLLM on H100?

Approximately 12,500 tok/s for Llama 3.1 8B (BF16) on a single H100 80GB. For larger models: 4,741 tok/s for GPT-OSS-120B on 2xH100 at 100 concurrent requests. TTFT ranges from 72ms (8B, low concurrency) to 261ms (120B, 32 concurrent).

How does vLLM compare to SGLang?

SGLang delivers ~29% higher throughput on H100 (16,200 vs 12,500 tok/s for Llama 8B) thanks to RadixAttention. vLLM wins on hardware support (TPUs, Trainium, Gaudi), model coverage, and ecosystem maturity. Pick SGLang for multi-turn workloads with shared prefixes. Pick vLLM for everything else.

How does vLLM compare to TensorRT-LLM?

TensorRT-LLM achieves 15-30% higher peak throughput after a 10-30 minute compilation step. NVIDIA-only. Use it for a single model that won't change. Use vLLM when you need model flexibility, multi-hardware support, or fast deployment.

What models does vLLM support?

Most open-source architectures: Llama, DeepSeek, Qwen, Mistral, Mixtral, Gemma, Phi, Command-R, and many more. Plus multimodal models (LLaVA, Pixtral, Qwen-VL) and embedding models. Unsupported architectures fall back to Transformers with under 5% performance penalty.

What quantization should I use?

FP8 on H100/B200 (native support, minimal quality loss, up to 1.6x throughput). AWQ with Marlin kernels for the best speed-quality tradeoff on older hardware. GPTQ with Marlin for coding tasks where GPTQ shows better real-world accuracy than AWQ.

How do I deploy vLLM in production?

Docker for single-node: docker run --gpus all vllm/vllm-openai --model your-model. Kubernetes for clusters: use the vLLM production-stack Helm chart with KEDA autoscaling and Prometheus monitoring. Monitor TTFT P99, KV cache utilization, and request queue depth.

What is vLLM V1?

A rewrite of vLLM's core architecture, default since v0.8.0. Delivers 1.7x throughput over V0, integrates FlashAttention 3, and makes prefix caching nearly free (<1% overhead at 0% hit rate). The latest release is v0.19.0 (April 2026).

Inference for Coding Agents

vLLM serves the model. Morph handles what comes after: Fast Apply merges LLM-generated code edits into your codebase at 10,500+ tok/s with deterministic behavior. WarpGrep provides semantic code search purpose-built for agent tool calls. The layer between your inference engine and your repository.