Autoregressive decoding generates one token per forward pass. On a 70B model, each pass takes ~30ms but uses less than 5% of the GPU's compute. Speculative decoding breaks this pattern: a small draft model proposes tokens, the target model verifies them in parallel, and accepted tokens are mathematically identical to standard output. 2-3x faster, zero quality loss.
Why Autoregressive Decoding Is Slow
Every transformer-based LLM generates tokens sequentially. Token 5 depends on token 4, which depends on token 3. Each token requires loading the full model weights from GPU memory to the compute units, running the forward pass, then writing the KV cache entry for the new token. On an H100 with a 70B model, this cycle takes roughly 30ms per token.
The problem is utilization. The decode phase is memory-bandwidth-bound, not compute-bound. The GPU's tensor cores sit mostly idle while memory bandwidth ferries weights back and forth. Larger batch sizes help because multiple sequences share the weight loads, but for single-request latency, the arithmetic intensity is too low to keep the hardware busy.
Speculative decoding exploits this gap. The GPU has spare compute during decode. If you can feed it K tokens to verify in one pass instead of generating them one at a time, you amortize the memory bandwidth cost across K tokens. The forward pass takes slightly longer than single-token decode (more KV cache entries to attend over), but far less than K sequential passes.
The core insight
During decode, the GPU has 10-20x more compute available than it uses. Speculative decoding converts that idle compute into accepted tokens. The speedup ceiling is determined by how many draft tokens the target model accepts per verification step.
How Speculative Decoding Works
The algorithm has three steps that repeat until generation is complete:
1. Draft
A small model (or other mechanism) generates K candidate tokens autoregressively. Because the draft model is 10-100x smaller than the target, this takes a fraction of the time.
2. Verify
The target model processes the original context plus all K draft tokens in a single forward pass. It computes the probability distribution at each draft position.
3. Accept/Reject
Starting from the first draft token, each is accepted if the target model agrees. On the first mismatch, all subsequent tokens are rejected and the target model generates a correction.
The verification step is where the speedup comes from. One forward pass through the target model processes K+1 tokens (K draft plus the correction position) instead of 1. If the draft model's acceptance rate is alpha, the expected tokens per step is 1/(1-alpha) instead of 1. At alpha=0.7 with 5 speculative tokens, you average ~3.3 tokens per target model forward pass.
Mathematical guarantee
The acceptance criterion uses a modified rejection sampling scheme. The target model samples from its own distribution, adjusted to reject tokens where the draft model over-predicted. This guarantees the output distribution is identical to what the target model would produce without speculation. Not approximately identical. Identical. The proof appears in the original Leviathan et al. (2023) and Chen et al. (2023) papers.
vLLM Methods Compared
vLLM ships five speculative decoding methods. Each trades off differently between speedup magnitude, memory overhead, and setup complexity.
| Method | Speedup | Extra VRAM | Best For | Limitation |
|---|---|---|---|---|
| Draft Model | 1.5-3x | Draft model size | General-purpose serving | Requires VRAM for draft model |
| EAGLE-3 | 3-6.5x | ~5% of target | Chat, reasoning, RAG | Trained on chat; poor on translation |
| Medusa | 2.2-3.6x | ~2% of target | Simple deployment | Lower acceptance rates than EAGLE |
| N-gram Lookup | Up to 2.8x | None | Summarization, code editing | Fails on open-ended generation |
| MLP/LSTM Speculator | 1.5-3.1x | Minimal | Mixed workloads | Requires training pipeline |
Draft model decoding
The most straightforward approach. Pair a small model (68M-1B parameters) from the same family as the target. The draft model runs autoregressively to produce K candidates. vLLM handles the verification and accept/reject loop internally.
Practical speedup: 1.5x on ShareGPT chat benchmarks with Llama-68M drafting for Llama-2-70B. The bottleneck is the draft model's latency, not its accuracy. Recent benchmarks show little correlation between draft model accuracy and end-to-end throughput. Draft latency is the stronger determinant.
EAGLE-3
EAGLE operates at the feature level. Instead of a separate draft model generating tokens, a lightweight prediction head ingests hidden states from the target model's transformer layers. EAGLE-3 fuses features from three layers (low, middle, high), concatenates them into a 12,288-dimensional vector, compresses through a fully connected layer back to model dimension, and feeds that into a single-layer draft decoder.
The training innovation is training-time testing: simulating actual inference conditions during training so the draft head learns from the distribution it will actually see at serving time. This addresses the mismatch between training data and inference distributions that limits other approaches.
EAGLE-3 achieves 3x-6.5x speedup over vanilla autoregressive generation, with 20-40% improvement over EAGLE-2. P-EAGLE (AWS, 2025) extends this with parallel draft generation, producing all K draft tokens in one forward pass for up to 1.69x additional speedup over vanilla EAGLE-3 on B200 GPUs.
Medusa
Medusa adds multiple decoding heads on top of the target model's last hidden state. Head 1 predicts token t+1, head 2 predicts t+2, and so on. Each head is a single-layer feed-forward network with a residual connection. During inference, Medusa constructs a tree of candidate continuations and verifies them simultaneously using tree-based attention.
Medusa-1 freezes the base model and only trains the heads: 2.2x speedup. Medusa-2 trains the full model alongside the heads: 2.3-3.6x. The trade-off vs EAGLE: Medusa's heads only see the previous token's feature vector, not the sampled result. EAGLE includes the sampled token's embedding, giving it a unique target for each input and higher acceptance rates.
N-gram prompt lookup
No draft model at all. vLLM matches n-grams in the prompt to predict future tokens. If the output is likely to repeat portions of the input (code editing, summarization, translation), this works with zero VRAM overhead and zero training cost. Prompt lookup achieved 2.8x speedup on CNN/DailyMail summarization in vLLM benchmarks. For open-ended generation, acceptance rates drop to near zero and it should be disabled.
MLP/LSTM speculator
Snowflake's Arctic Inference trains lightweight MLP and LSTM networks as speculators. Their hybrid LSTM-suffix approach combines neural prediction with n-gram matching, outperforming either alone. Suffix decoding runs at 20 microseconds per speculated token on CPU, and the end-to-end system reduces decoding time by 2.3-6.3x on SWE-Bench coding tasks.
Configuration Guide
vLLM supports speculative decoding in both offline (Python API) and online (server) modes. The configuration interface changed in v0.8+ to use speculative_config as a dictionary.
Offline mode: draft model
Draft model speculative decoding
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
tensor_parallel_size=4,
speculative_config={
"method": "draft_model",
"model": "meta-llama/Llama-3.2-1B-Instruct",
"num_speculative_tokens": 5,
},
)
outputs = llm.generate(
["Explain speculative decoding in three sentences."],
SamplingParams(temperature=0.7, top_p=0.95),
)Offline mode: EAGLE-3
EAGLE-3 speculative decoding
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
speculative_config={
"method": "eagle",
"model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 5,
"draft_tensor_parallel_size": 1,
"max_model_len": 4096,
},
)
outputs = llm.generate(
["Write a Python function to merge two sorted lists."],
SamplingParams(temperature=0.0),
)Offline mode: n-gram prompt lookup
N-gram prompt lookup (no draft model needed)
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
speculative_config={
"method": "ngram",
"num_speculative_tokens": 5,
"prompt_lookup_max": 4,
},
)
# Best for summarization and code editing where output mirrors input
outputs = llm.generate(
["Summarize the following article: " + long_article],
SamplingParams(temperature=0.0),
)Server mode
vllm serve with speculative decoding
# Simple: model with bundled speculator config
vllm serve RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3
# Advanced: explicit configuration
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--speculative-config '{
"method": "eagle",
"model": "yuhuili/EAGLE-LLaMA3.1-Instruct-70B",
"num_speculative_tokens": 5
}'
# N-gram prompt lookup (no draft model)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-config '{
"method": "ngram",
"num_speculative_tokens": 5,
"prompt_lookup_max": 4
}'Speculators library
The vllm-project/speculators library standardizes speculative decoding model packaging. Models published with a speculators_config can be served with a single vllm serve command, no manual configuration required. Red Hat, Snowflake, and the vLLM community publish pre-trained speculator models in this format.
Draft Model Selection
The draft model determines whether speculative decoding helps or hurts. A bad draft model wastes compute on tokens the target model rejects. A good one turns idle GPU cycles into accepted tokens.
Same family, smaller size
Llama-3.2-1B for Llama-3.1-70B. GPT-2 for GPT-4. Models from the same training pipeline share distribution characteristics, producing higher acceptance rates than arbitrary small models.
Acceptance rate above 0.6
Below 0.5, speculative decoding hurts. Between 0.5-0.6, it barely breaks even. Above 0.6 with 5+ speculative tokens, expect 2-3x speedup. Measure on your actual workload, not benchmarks.
Draft latency over accuracy
Recent benchmarks show draft model latency matters more than accuracy for end-to-end throughput. A faster draft model with slightly lower acceptance rate can outperform a slower, more accurate one.
Domain matters
EAGLE achieves up to 2.1x improvement on RAG and math reasoning but performs poorly on translation. N-gram lookup excels at summarization and code editing but fails on open-ended chat. Always benchmark your workload.
Fine-tuning draft models
Off-the-shelf draft models work for general workloads. For domain-specific tasks, fine-tuning a draft model on your data distribution can boost acceptance rates significantly. Snowflake's Arctic Training pipeline produces MLP/LSTM speculators with 3.1x higher acceptance rates than generic alternatives. The speculators library (v0.3.0) supports offline training data generation through a hidden states generator, producing the training data needed to build custom speculators from standard text datasets.
When It Helps vs. When It Hurts
Speculative decoding is not universally beneficial. The speedup depends on a specific set of conditions, and violating any of them can make inference slower.
| Condition | Helps | Hurts |
|---|---|---|
| QPS / Batch Size | Low QPS, batch 1-32 | High QPS, batch 32+ |
| Output Length | Long generation (100+ tokens) | Short completions (<50 tokens) |
| GPU Utilization | Memory-bandwidth-bound (decode) | Compute-bound (large batches) |
| Output Predictability | Structured: code, summaries, JSON | Open-ended: creative, chat |
| VRAM Budget | Room for draft model | Target uses 90%+ VRAM |
| Acceptance Rate | Above 0.6 | Below 0.5 |
The memory-bound vs compute-bound boundary
This is the critical insight from MagicDec (2024): for every model and hardware pair, there exists a critical sequence length beyond which inference becomes memory-bound even at large batch sizes. The KV cache becomes the primary memory bottleneck and, unlike model parameters, it scales with batch size. Beyond this boundary, speculative decoding helps even at high batch sizes because verification is memory-bandwidth-limited, not compute-limited.
For moderate to long sequences (2K+ tokens), speculative decoding enhanced throughput by up to 2.51x for LLaMA-3.1-8B across batch sizes from 32 to 256. Short sequences with large batches remain the scenario where speculative decoding consistently hurts.
Quick decision rule
Benchmark your specific workload before deploying. Run 100 requests with and without speculative decoding, measuring P50 and P99 latency. If acceptance rate is below 0.5 on your traffic, disable it. If TTFT increases more than the decode speedup saves, disable it. The overhead is not theoretical: vLLM's speculative decoding underperforms the non-speculative baseline at higher batch sizes in some configurations.
Production Tuning
Getting speculative decoding working is easy. Getting it to help in production requires tuning three variables.
num_speculative_tokens
Start at 3-5 for general workloads. For code models, 6-8 works well because code has more predictable token patterns. Increasing beyond 8 tokens rarely helps: the probability of a long accepted sequence decreases exponentially with length, and each rejected token after the first is wasted compute.
The spec factor in vLLM controls adaptive speculation length:max_spec_tokens = max_spec_factor * prefix_match_length. This dynamically adjusts the number of draft tokens based on how well the draft model is performing on the current sequence.
Monitoring acceptance rate
Acceptance rate varies with traffic patterns. A model that achieves 0.7 acceptance on benchmarks may drop to 0.4 on real production traffic with diverse prompt distributions. Monitor it continuously. If it drifts below 0.5, speculative decoding is costing you throughput. vLLM exposes acceptance rate metrics through its Prometheus endpoint.
VRAM management
Draft models share GPU memory with the target model's KV cache. On VRAM-constrained setups, the draft model may reduce the maximum batch size or context length the system can handle. For EAGLE and Medusa, the overhead is small (2-5% of target model size). For separate draft models, plan for the draft model's full memory footprint. If the target model already uses 90%+ VRAM, consider n-gram lookup or MLP speculators instead.
Start conservative
3-5 speculative tokens, benchmark against no speculation. If latency improves, increase tokens. If it doesn't, the workload may not benefit.
Monitor in production
Acceptance rate changes with traffic. A 0.7 rate on benchmarks can drop to 0.4 on real traffic. Track via Prometheus and set alerts below 0.5.
Batch size crossover
Speculative decoding helps at low batch sizes and can hurt at high batch sizes. Find your crossover point experimentally. Beyond batch 64, test both configurations.
Speculative Edits: Cursor and Morph
Code editing is where speculative decoding delivers its largest gains, because the output is highly predictable. When rewriting a file, most tokens in the output match the original file exactly. The model only needs to change a few lines. Traditional autoregressive decoding wastes time regenerating tokens it could have copied.
Cursor's approach: speculative edits
Cursor fine-tuned Llama-3-70B for code editing and deployed it on Fireworks with a variant called speculative edits. The key insight: instead of using a draft model, the caller supplies the original file contents as the speculative draft. The model verifies which tokens to keep and which to change. Because 80-95% of tokens in a typical code edit match the original file, the acceptance rate is extremely high.
The result: 1,000 tok/s on a 70B model. A 13x speedup over vanilla inference and 9x over their previous GPT-4 deployment. This is the logical extreme of speculative decoding for a task where the caller has a near-perfect prior on the output.
Morph Fast Apply: beyond speculation
Morph takes a different path. Instead of speculating from the original file at serving time, Fast Apply is a model trained specifically for code edits. The model's internal distribution is sharp enough that it functions as its own speculation: for any given (file, instruction) pair, there are very few correct outputs, so the top token wins by a large margin at every position.
No separate draft model. No speculation overhead. No VRAM split. No acceptance rate variability. The model achieves the speed that speculative decoding targets by training on a narrow enough task that prediction confidence is inherently high.
When to use which
Use vLLM speculative decoding when you need to accelerate a general-purpose model for diverse tasks. Use Morph Fast Apply when you need maximum speed on code edits specifically. The approaches are complementary: speculative decoding accelerates the reasoning model, Fast Apply handles the edit application step.
Frequently Asked Questions
What is speculative decoding in vLLM?
Speculative decoding in vLLM uses a small, fast draft model to propose multiple tokens ahead, then verifies them against the target model in a single parallel forward pass. Accepted tokens are mathematically identical to standard autoregressive output. vLLM supports five methods: draft model, EAGLE, Medusa, n-gram prompt lookup, and MLP speculator. Typical speedups range from 1.5x to 2.8x, with EAGLE-3 reaching 6.5x.
How do I enable speculative decoding in vLLM?
In offline mode, pass a speculative_config dictionary to the LLM constructor with keys for method, model (the draft model path), and num_speculative_tokens. In server mode, use --speculative-config with a JSON string. For models with bundled speculator configs, a simple vllm serve model-name is sufficient.
What is the difference between EAGLE and Medusa?
EAGLE adds a lightweight prediction head that ingests hidden states from the target model, including the embedding of the sampled token. EAGLE-3 fuses features from three layers, achieving 3x-6.5x speedup. Medusa adds multiple decoding heads that predict future positions simultaneously, achieving 2.2-3.6x. EAGLE achieves higher acceptance rates because it conditions on the sampled result. Medusa is simpler to deploy since it requires no separate model.
How do I choose a draft model?
Use a model from the same family as your target (Llama draft for Llama target). Use the smallest model maintaining above 0.6 acceptance rate. Benchmark on your actual workload, not published benchmarks. Draft latency matters more than draft accuracy for end-to-end throughput. Start with 3-5 speculative tokens and increase only if acceptance rates stay high.
When does speculative decoding hurt performance?
At batch sizes above 32 when the GPU is compute-bound. On short completions under 50 tokens where setup cost exceeds savings. On VRAM-constrained GPUs where the draft model causes OOM or cache fragmentation. And whenever acceptance rate drops below 0.5, at which point the draft model overhead exceeds the benefit of accepted tokens.
What speedup should I expect?
1.5x with off-the-shelf draft models on general chat. 2.8x with prompt lookup on summarization tasks. 3x-6.5x with EAGLE-3 on optimized configurations. 1.8-4.5x with suffix decoding on coding tasks. The key variable is acceptance rate: above 0.6 with 5+ tokens, expect 2-3x. Below 0.5, expect negative impact.
How does Cursor achieve 1,000 tok/s?
Cursor fine-tuned Llama-3-70B for code editing and uses the original file contents as the speculative draft instead of a separate model. Because 80-95% of tokens in a code edit match the original, acceptance rates are extremely high, delivering 13x speedup over vanilla inference.
How does Morph Fast Apply compare?
Morph Fast Apply achieves 10,500 tok/s for code edits without a separate draft model or speculation overhead. The model is trained on the narrow task of applying edits, producing confidence levels high enough that the top token wins at every position. No VRAM split, no acceptance rate variability, no draft model to maintain.
Code Edits at 10,500 tok/s
Morph Fast Apply delivers the speed that speculative decoding targets, without the draft model, VRAM split, or acceptance rate tuning. One model, trained for one task.