Tokens per second is the most quoted LLM benchmark. It is also the most misunderstood. The number on a leaderboard rarely predicts what you experience in production, because tok/s conflates three different measurements. This guide covers what tok/s actually means, where every major provider lands in 2026, and why coding agents care about a speed axis that most benchmarks ignore entirely.
What Tokens Per Second Actually Measures
A token is the atomic unit of LLM input and output. One token maps to roughly 3-4 characters of English text or 2-3 characters of code. The word “function” is one token. A line of Python like def calculate_total(items): is about 7 tokens.
Tokens per second measures the rate at which a model generates output tokens during the decode phase of inference. After the model processes your input prompt (the prefill phase), it begins generating output tokens one at a time, each conditioned on every token that came before it. The speed of this autoregressive generation is what tok/s captures.
A model running at 100 tok/s produces about 75-100 words per second. That is roughly 15x faster than the average human typing speed and 20x faster than comfortable reading speed. At 50 tok/s, the model already outruns any human consumer of its output. So why does speed beyond 50 tok/s matter? Because not every consumer is human.
Three Metrics, One Name
When someone says “tokens per second,” they could mean any of three distinct measurements. Confusing them leads to bad provider decisions.
Time to First Token (TTFT)
How long you wait before the first token appears. Determined by the prefill phase, where the model processes all input tokens in parallel. Under 200ms feels instant. Over 1 second feels sluggish. Scales linearly with input length.
Output Speed (tok/s)
How fast tokens stream after the first one arrives. This is what most people mean by 'tokens per second.' Determined by the decode phase. Memory-bandwidth-bound on GPUs. Ranges from 30 tok/s (local) to 2,100 tok/s (Cerebras).
Throughput (total tok/s)
Total tokens generated per second across all concurrent requests. A system running 10 requests at 80 tok/s each has 800 tok/s throughput. High throughput does not mean fast individual responses. This is the number providers optimize for.
A provider quoting “800 tok/s” might mean 800 tok/s throughput across 10 concurrent users, each experiencing 80 tok/s. Or it might mean 800 tok/s on a single request with a tiny 0.8B model. Or it might mean 800 tok/s on a batch of pre-cached prompts with artificially low TTFT. The number without the measurement conditions is meaningless.
How Artificial Analysis standardizes measurement
Artificial Analysis benchmarks every model using OpenAI's tiktoken tokenizer (o200k_base) so the same text counts as the same number of tokens regardless of the model's native tokenizer. They test 8 times per day at varying intervals, report output speed as tokens received per second after the first chunk, and publish live p50 metrics across 72-hour windows. When comparing providers, use a benchmark that standardizes tokenization, concurrency, and measurement intervals.
The Latency Equation
Total response latency for a single request breaks down to:
Latency = TTFT + (output_tokens / output_speed)
A request generating 500 tokens with 0.5s TTFT and 100 tok/s output speed takes 5.5 seconds total. The same request with 0.2s TTFT and 200 tok/s takes 2.7 seconds. Both TTFT and output speed contribute, but their relative importance depends on your use case. Streaming chatbots prioritize TTFT. Batch processing prioritizes throughput. Coding agents prioritize output speed on specific subtasks, as we will see.
Why Coding Agents Care About a Different Speed
Human users perceive speed through TTFT and reading speed. Once tokens arrive faster than ~250 words per minute, a faster model feels the same. The bottleneck shifts to human cognition.
Coding agents have no cognition bottleneck. They consume output at the speed it arrives and immediately take the next action. Every millisecond of generation latency delays the next tool call, the next file edit, the next test run. In a 30-step agentic task, a 5-second delay per step means 2.5 minutes of pure waiting.
The Three Speed Bottlenecks
Agent task latency decomposes into three bottlenecks, and most optimization ignores the biggest one:
Search Latency
Finding relevant code in the codebase. Agents spend 60-80% of their tokens on search-then-read patterns. Semantic search tools reduce this by returning relevant ranges instead of entire files.
Reasoning Latency
The frontier model deciding what to change. Determined by model output speed (40-130 tok/s for API models). Reasoning models like o3 or Gemini 2.5 Pro add thinking time. Hard to optimize without switching models.
Apply Latency
Writing the change into the file. A 1,000-line file rewrite at 80 tok/s takes 12 seconds. Most agents use the same frontier model for this step, wasting both time and money on a task that doesn't need frontier intelligence.
Reasoning is the step that needs a frontier model. Apply is the step that needs speed. When a coding agent asks Claude or GPT to rewrite an entire file to apply a 5-line change, it is paying frontier-model prices for a mechanical merge operation. The fix is to route the apply step to a model built for speed on that specific task.
The compounding latency problem
Agentic workflows chain steps sequentially. Each step waits for the previous one. A 20-step coding task where each apply takes 12 seconds at 80 tok/s spends 4 minutes on file writes alone. At 10,500 tok/s, the same 20 applies finish in under 20 seconds total. Pipeline latency directly impacts developer velocity and, at scale, deployment frequency.
Provider Speed Comparison (April 2026)
The table below shows output speed (single-request tok/s) for notable providers and models. Numbers are from Artificial Analysis, BenchLM, and provider-reported benchmarks. All measurements use standardized tokenization.
| Provider | Model | Output tok/s | TTFT | Notes |
|---|---|---|---|---|
| Cerebras | Llama 3.1 70B | 2,100 | ~0.3s | Wafer-scale engine, fastest raw speed |
| Cerebras | Llama 3.1 405B | 969 | ~0.4s | Largest model at near-1K tok/s |
| SambaNova | Llama 3.1 70B | 580 | ~0.2s | Best TTFT among fast providers |
| Groq | Llama 3.3 70B | 276 | ~0.3s | Consistent speed across context lengths |
| Together AI | Llama 3.1 70B | ~200 | ~0.4s | 2x faster after late-2025 upgrades |
| Fireworks AI | Llama 3.3 70B | ~150 | ~0.6s | Optimized for breadth of model support |
| DeepInfra | Llama 3.1 70B | ~120 | ~0.5s | Cost-optimized, reliable throughput |
| Model | Provider | Output tok/s | TTFT | Notes |
|---|---|---|---|---|
| Gemini 2.5 Flash | 238 | 0.5s | Fastest frontier-class model | |
| GPT-4.1 nano | OpenAI | 181 | 0.6s | Fast but limited capability |
| Grok 4.1 Fast | xAI | 138 | 0.5s | Best speed-to-quality ratio |
| GPT-4o | OpenAI | 131 | 0.8s | Workhorse model, now deprecated |
| Gemini 2.5 Pro | 117 | 21s | Reasoning model, high TTFT | |
| GPT-5.2 | OpenAI | 73 | 130s | Reasoning model, very high TTFT |
| Claude Sonnet 4 | Anthropic | 40-55 | 1.3s | Best coding quality, moderate speed |
| Claude Opus 4.6 | Anthropic | 40 | 1.8s | Highest quality, lowest speed |
| DeepSeek V3.2 | DeepSeek | 35 | 3.8s | Cheapest frontier model |
| Setup | Model | Output tok/s | TTFT | Notes |
|---|---|---|---|---|
| RTX 4090, llama.cpp | Llama 3.1 8B (Q4) | ~70 | <0.2s | Best consumer GPU performance |
| RTX 4090, Ollama | Llama 3.1 8B (Q4) | ~65 | ~0.25s | Slightly slower than raw llama.cpp |
| RTX 4060 Ti | Llama 3.1 8B (Q4) | ~40 | ~0.3s | Mid-range GPU, usable speed |
| M3 Max, MLX | Llama 3.1 8B (Q4) | ~55 | ~0.2s | Apple Silicon optimized |
| CPU only (32-core) | Llama 3.1 8B (Q4) | ~12 | ~1.5s | Usable for development, not production |
| Model | Task | Output tok/s | How | Notes |
|---|---|---|---|---|
| Morph Fast Apply (v3-fast) | Code merging | 10,500 | 7B + speculative decoding | 98% accuracy on code edits |
| Morph Fast Apply (v3-large) | Code merging | 5,000 | 14B model | Higher accuracy on complex merges |
| Mercury 2 (Inception) | General text | 928 | Optimized architecture | Fastest on Artificial Analysis |
Two patterns stand out. First, specialized inference hardware (Cerebras, Groq, SambaNova) runs 5-20x faster than GPU-based API providers on the same open-source models. The tradeoff is model selection: these providers only support a handful of models. Second, task-specific models demolish general-purpose speed records. Morph Fast Apply at 10,500 tok/s is not competing with Claude at 40 tok/s on the same task. It is doing a narrower task (code merging) fast enough that the step becomes negligible in the overall pipeline.
What Affects Your Tokens Per Second
Six factors determine the tok/s you actually experience. Understanding them explains why the same model at the same provider can feel fast one day and slow the next.
1. Model Size
The dominant factor. Decode speed is memory-bandwidth-bound: each token requires reading the full model weights from memory. Larger models have more weights to read. On the same GPU, a 7B model runs 3-5x faster than a 70B. The relationship is not perfectly linear because larger models also have more layers of KV cache to read, and cache access patterns interact with memory bandwidth differently.
| Model | Parameters | Approx. Output tok/s | Relative Speed |
|---|---|---|---|
| Llama 3.1 8B | 8B | ~150 | 1x (baseline) |
| Llama 3.1 70B | 70B | ~40 | 0.27x |
| Llama 3.1 405B | 405B (multi-GPU) | ~15 | 0.10x |
2. Quantization
Reducing weight precision from FP16 (16-bit) to INT4 (4-bit) halves memory reads per token, roughly doubling speed. FP8 gives a 33% improvement with near-zero accuracy loss. INT4 methods like AWQ and GPTQ push further but introduce 1-3% accuracy degradation. The kernel implementation matters more than the quantization method: AWQ with Marlin kernels runs at 741 tok/s versus 68 tok/s without, a 10.9x difference from the same quantized weights.
3. Context Length
Longer input prompts increase TTFT linearly because prefill processes all input tokens. They also grow the KV cache, which the model reads at every decode step. On a 70B model, a 2K-token prompt might yield 80 tok/s output speed while a 100K-token prompt drops to 50 tok/s as the KV cache exceeds on-chip memory and spills to slower storage.
4. Batch Size
Batching multiple requests together increases GPU utilization and system throughput. Continuous batching (used by vLLM, SGLang, and TensorRT-LLM) can push total throughput to 2,400-2,780 tok/s on an H100 with 100 concurrent requests. But batching trades individual speed for collective throughput. Each user gets a fraction of the total.
5. Hardware Architecture
Cerebras wafer-scale engines and Groq LPUs are designed for sequential token generation. They keep model weights in on-chip SRAM instead of HBM, eliminating the memory bandwidth bottleneck that limits GPU decode speed. The result: 5-20x faster per-request speeds on supported models. NVIDIA GPUs compensate with software optimizations (speculative decoding, continuous batching, Flash Attention) that narrow but do not close the gap.
6. Speculative Decoding
A small draft model (1-7B parameters) generates 3-12 candidate tokens per step. The target model verifies them all in one parallel forward pass. At 70-90% acceptance rates on domain-specific tasks, this yields 2-3x speedup with zero quality loss. Now standard in vLLM, SGLang, and TensorRT-LLM. Morph Fast Apply uses speculative decoding on its 7B model to reach 10,500 tok/s, because code merge patterns are highly predictable for a well-tuned draft model.
How to Measure Tokens Per Second Correctly
Most published benchmarks are useless for predicting your production experience. A model that runs at 200 tok/s in a controlled test can drop to 60 tok/s under moderate concurrency, or spike its TTFT from 0.5s to 5s during peak load.
Standardize Tokenization
Different models use different tokenizers. The same English sentence might be 10 tokens for GPT-4 and 12 tokens for Llama. Use a common tokenizer like tiktoken (o200k_base) when comparing. Artificial Analysis does this. Most vendor benchmarks do not.
Separate TTFT from Output Speed
A model with 20s TTFT and 200 tok/s output speed will feel slower than one with 0.5s TTFT and 80 tok/s for short requests. Always report both. TTFT matters for interactive use. Output speed matters for long generations.
Test at Real Concurrency
Single-request benchmarks measure peak per-user speed. Production systems handle 10-1000+ concurrent requests. Speed under load is typically 30-70% of single-request speed. Test at your actual traffic patterns.
Measure Over Days, Not Minutes
Provider speeds fluctuate with load. A 5-minute benchmark captures a snapshot, not reality. Artificial Analysis publishes 72-hour rolling averages. Run your own benchmarks for at least 24 hours, sampling every 3 hours, before making provider decisions.
Benchmarking Tools
| Tool | Type | What It Measures | Best For |
|---|---|---|---|
| NVIDIA GenAI-Perf | Open-source CLI | TTFT, ITL, throughput, p50/p95/p99 | Self-hosted inference engines |
| Artificial Analysis | Live leaderboard | Output speed, TTFT, price, quality | Comparing API providers |
| BenchLM | Live leaderboard | Tok/s, TTFT by provider and model | Quick provider comparison |
| llm-benchmark (Baseten) | Open-source | Output TPS, latency percentiles | Custom benchmarks on your workload |
Report percentiles, not averages
Average latency hides outliers that destroy user experience. If your p50 is 80 tok/s but p99 is 15 tok/s, 1 in 100 requests crawls. For coding agents that chain 20+ steps, a p99 slowdown on any single step delays the entire task. Report p50, p95, and p99. If a provider only shows averages, that is a red flag.
The Apply Speed Gap
The speed tables above compare models on general text generation. For coding agents, a different benchmark matters: how fast can the model merge an edit into an existing file?
Most coding agents use their frontier model (Claude, GPT, Gemini) for every step, including the mechanical step of applying edits. The frontier model rewrites the entire file. For a 1,000-line file, that means generating 3,500-4,500 tokens at 40-130 tok/s. That is 27-112 seconds per edit, and each edit costs frontier-model prices ($3-15 per million output tokens).
Morph Fast Apply is a 7B model trained specifically on code merging. It takes the original file, a code edit snippet with “existing code” markers for unchanged sections, and an instruction. It returns the complete merged file. At 10,500 tok/s, a 500-line file merges in 0.8 seconds. A 1,000-line file takes 1.3 seconds. Compare that to 12+ seconds with a frontier model.
The speed comes from three factors: a small model (7B parameters), speculative decoding tuned for highly predictable code merge patterns, and custom CUDA kernels. Code merging is a well-defined enough task that a specialized model beats general-purpose models on both speed and accuracy. Benchmarks show 100% merge success rate when pairing Fast Apply with GPT-5, Claude Sonnet 4, and DeepSeek. Search-and-replace with the same models drops to 84-96%.
| Metric | Frontier Model (80 tok/s) | Morph Fast Apply (10,500 tok/s) |
|---|---|---|
| 500-line file edit | ~22 seconds | 0.8 seconds |
| 1,000-line file edit | ~45 seconds | 1.3 seconds |
| Tokens per edit | 3,500-4,500 | 700-1,400 |
| Cost per edit (output) | $0.053-0.068 | $0.0010-0.0017 |
| 20 edits in a task | ~15 minutes | ~20 seconds |
The cost reduction compounds with the speed gain. Fast Apply uses 50-60% fewer tokens per edit (because it does not rewrite unchanged code) at a per-token price that is 10-18x cheaper than frontier models. A 20-edit coding task drops from $1.06-1.36 in apply costs to $0.02-0.03.
The subagent architecture
The fastest coding agent is not one fast model. It is a system that routes each subtask to the model built for it. Frontier models for reasoning. Specialized models for apply. Compact for context compression. The agent orchestrates. Each subagent operates in its own context window at its optimal speed. Anthropic found that multi-agent architectures improve performance by up to 90% on complex tasks, and the speed gains from dedicated apply models are a major reason why.
Frequently Asked Questions
What does tokens per second mean for LLMs?
Tokens per second measures the rate at which a model generates output during inference. One token is about 3-4 characters. The metric has three variants: TTFT (time to first token), output speed (generation rate after the first token), and throughput (total tokens across all concurrent requests). A model at 100 tok/s generates about 75-100 words per second.
What is a good tokens per second speed?
Depends on the use case. For chatbots, 50+ tok/s is sufficient since it outruns human reading speed. For coding agents, faster is meaningfully better because the agent executes on output immediately. Frontier API models run at 40-130 tok/s. Specialized inference providers like Cerebras reach 2,100 tok/s on open-source models. For code apply operations, Morph Fast Apply runs at 10,500 tok/s.
Which LLM inference provider is fastest?
As of April 2026: Cerebras leads on open-source model speed at 2,100 tok/s for Llama 3.1 70B. SambaNova follows at 580 tok/s with the best TTFT (0.2s). Groq delivers 276 tok/s with the most consistent speed across context lengths. Among frontier API models, Gemini 2.5 Flash leads at 238 tok/s. See the full comparison tables above.
What is the difference between TTFT and output tokens per second?
TTFT measures the delay before the first token appears, determined by the prefill phase. Output tok/s measures generation speed after the first token, determined by the decode phase. A model can have fast TTFT and slow generation, or vice versa. Reasoning models (o3, GPT-5.2, Gemini 2.5 Pro) have especially high TTFT (20-130 seconds) because they do extensive thinking before generating.
What affects tokens per second speed?
Model size (7B runs 3-5x faster than 70B), quantization (INT4 roughly doubles speed vs FP16), context length (longer contexts slow both TTFT and output speed), batch size (increases system throughput, can slow individual requests), hardware (Cerebras and Groq chips vs NVIDIA GPUs), and software optimizations (speculative decoding gives 2-3x speedup).
Why do coding agents need faster tok/s than chatbots?
Chatbot users read at 4-5 words per second. A model at 50 tok/s already outruns them. Coding agents do not read. They execute. An agent waiting for a 1,000-line file rewrite at 80 tok/s stalls for 12 seconds. At 10,500 tok/s, the same edit completes in 1.3 seconds. That 10-second gap compounds across every edit in a multi-step task.
How do I benchmark LLM speed correctly?
Standardize tokenization (use tiktoken o200k_base for cross-model comparison). Separate TTFT from output speed. Test at your actual concurrency, not single-request. Run for 72+ hours to capture load variance. Report p50, p95, and p99 latencies. Tools: NVIDIA GenAI-Perf for self-hosted engines, Artificial Analysis and BenchLM for API providers.
10,500 tok/s on Code Edits
Morph Fast Apply merges AI-generated edits into files at 10,500 tok/s with 98% accuracy. A 1,000-line file in 1.3 seconds. OpenAI-compatible API.