Speculative Decoding: How Draft Models Multiply LLM Inference Speed

Speculative decoding samples from an autoregressive LLM 2-3x faster without changing its output. A small draft model proposes K tokens; the large target model verifies all of them in one forward pass, and accepted tokens are nearly free. Because decoding is memory-bandwidth-bound, that single verification pass costs almost the same as generating one token. The output is provably identical to the target model. Morph runs ngram speculative decoding at k=64 to serve morph-v3-fast at ~10,500 tok/s.

2-3x

Typical lossless speedup

10,500 tok/s

morph-v3-fast with ngram spec decoding (k=64)

1 pass

Verifies K draft tokens at once

Lossless

Output distribution preserved

Why Decoding Is Slow (And Why That Helps)

Autoregressive text generation produces one token per forward pass, and each forward pass is dominated by the cost of loading the model's layer weights from memory into the compute cores, not by the arithmetic itself. Decoding is memory-bandwidth-bound. This is the central fact that speculative decoding is built on.

Because the bottleneck is loading weights and not computing, the GPU has spare compute during every decode step. Verifying several proposed tokens in a single forward pass uses that spare compute and reads the model weights only once, instead of once per token. That is why scoring K short continuations costs about the same as sampling one token from the target.

HuggingFace frames this directly: the latency of verifying a short draft in one pass is comparable to generating a single token, so provided the guesses are correct often enough, assisted generation yields roughly 2-3x faster inference because LLMs are memory bound (per HuggingFace's assisted-generation writeup).

The core observation

Leviathan et al. (arXiv:2211.17192) state it precisely: the latency of parallel scoring of short continuations from a faster but less powerful draft model is comparable to that of sampling a single token from the larger target model. Speculative decoding turns that comparison into free tokens.

The Draft-Then-Verify Mechanism

The algorithm has two models. The draft model is small and fast. The target model is the large model whose output you actually want. The draft proposes a chunk of K candidate tokens autoregressively, which is cheap because the draft is small. The target then verifies all K candidates in one forward pass.

Verification compares the candidates left to right against what the target would have produced. The first mismatch invalidates every candidate after it, and the run continues from the corrected token. Every candidate accepted before that mismatch is a token the target effectively generated without spending its own forward pass on it.

Concretely: if the draft proposes 5 tokens and the target accepts the first 3, those 3 tokens cost one target forward pass instead of three. The 4th was wrong, so it is replaced by the target's own token, and the next draft round starts from there. Across many rounds, the average accepted length per pass determines the speedup.

Draft-then-verify loop (conceptual)

// One round of speculative decoding
// 1. Draft model proposes K candidate tokens (cheap, it is small)
const draft = draftModel.generate(context, { numTokens: K })  // e.g. K = 64

// 2. Target model verifies ALL K candidates in ONE forward pass
//    (this single pass costs ~the same as generating 1 target token)
const targetLogits = targetModel.forward(context, draft)

// 3. Accept candidates left-to-right; first mismatch stops the run.
//    Modified rejection sampling keeps the target's exact distribution.
const accepted = verify(draft, targetLogits)   // e.g. accepts 41 of 64

// 4. Append accepted tokens + one corrected target token, then repeat.
//    Accepted tokens were generated for ~free.
context.push(...accepted, correctedToken)

Speculative Decoding Is Lossless

The verification step is not a heuristic accept/reject. It uses a modified rejection sampling scheme that provably produces the same output distribution as standard decoding from the target model alone, within hardware numerics. This is the property that makes speculative decoding safe to run in production: it is a speed optimization, not a quality tradeoff.

The technique requires no retraining and no architecture changes to off-the-shelf models. You take an existing target model, pair it with a draft, and the sampled text is statistically indistinguishable from the target model run by itself. Leviathan et al. demonstrated this on T5-XXL with identical outputs and a 2x-3x acceleration.

DeepMind's speculative sampling (Chen et al., arXiv:2302.01318) confirmed the same result at scale, benchmarking on the 70-billion-parameter Chinchilla model and reaching a 2-2.5x decoding speedup on XSum and 100-shot HumanEval without compromising sample quality or modifying the model.

Lossless is the whole point

Quantization, distillation, and pruning all trade some quality for speed. Speculative decoding does not. Because the modified rejection sampling preserves the target distribution, you get the speedup with the same model quality. EAGLE makes the same lossless guarantee while drafting at the feature level inside the target.

Acceptance Rate Drives Speedup

Speedup is governed by the acceptance rate: the fraction of proposed tokens the target accepts before the first mismatch. Because the first wrong token invalidates everything after it, a higher average accepted length per pass means more free tokens and a larger speedup.

Speedup favors three conditions. The assistant should be at least an order of magnitude smaller than the primary model, so drafting stays cheap. The task should be input-grounded (summarization, translation, ASR, code editing), where the next tokens are often predictable from the input. And sampling should be greedy or low-temperature, because high temperature makes the target's choices less predictable and lowers acceptance.

The numbers cluster around 2-3x for well-matched pairs. EAGLE-3 reports an average accepted length up to 7.5 tokens per pass on HumanEval code generation, which translates to a speedup ratio up to 6.5x. The accepted length is the lever: more tokens per verification pass, more speed.

2-3x

Vanilla draft-model speedup (T5-XXL)

2-2.5x

Speculative sampling on Chinchilla 70B

up to 7.5

EAGLE-3 accepted length on HumanEval

up to 6.5x

EAGLE-3 speedup ratio

The Variants

All speculative decoding methods share the draft-then-verify loop. They differ in where the draft tokens come from, whether that drafting requires training, and the acceptance rate it achieves. The table below maps the four main families.

Variant	How it drafts	Needs training	Typical speedup
Draft model (vanilla)	A separate small model proposes tokens autoregressively	No (uses an existing small model)	2-3x (T5-XXL); 2-2.5x on Chinchilla 70B
Ngram / prompt-lookup	String-matches recent tokens against earlier ngrams in the prompt; no model	No (no model at all)	2x-4x, ~2.4x avg on summarization
Medusa	Extra decoding heads on the target predict multiple tokens, verified with tree attention	Yes (trains the heads)	Medusa-1 over 2.2x; Medusa-2 2.3-3.6x
EAGLE / EAGLE-2 / EAGLE-3	Drafts at the feature level inside the target with a dynamic draft tree	Yes (trains a lightweight draft head)	2.7-3.5x; EAGLE-2 3.05-4.26x; EAGLE-3 up to 6.5x

Self-speculative methods are the broader family that EAGLE and Medusa belong to: the draft signal comes from inside the target model rather than from a separate network, which removes the need to load and run a second model. The tradeoff is that these methods require a training step to add and tune the extra heads or draft layers.

No training, no second model

Ngram / prompt-lookup needs nothing beyond the target model. It matches recent tokens against earlier ngrams in the prompt and returns the continuation as the draft. Best when output reuses input text: editing, summarization, code transforms.

No training, second model

Vanilla draft-model decoding pairs the target with an existing smaller model from the same family. No new training, but you load and run two models. The draft must be small enough that its overhead does not outweigh the verification savings.

Trained heads on the target

Medusa adds extra decoding heads and verifies candidate continuations with tree attention. One model, but the heads must be trained. Medusa-1 exceeds 2.2x; Medusa-2 reaches 2.3-3.6x.

Trained feature-level draft

EAGLE autoregresses at the second-to-top-layer feature level and stays lossless. EAGLE-2 adds a context-aware dynamic draft tree (3.05-4.26x); EAGLE-3 predicts tokens directly via multi-layer fusion (up to 6.5x).

Ngram / Prompt-Lookup Decoding

Ngram speculation, also called prompt lookup decoding, removes the draft model entirely. Instead of running a second network, it matches the last few generated tokens against earlier ngrams in the prompt. When it finds a match, the tokens that followed that ngram in the prompt become the draft candidates. The target verifies them the same way it would verify any draft.

This works because many tasks reuse text from the input. In summarization, the summary copies phrases from the source. In code editing, the edited file repeats large spans of the original. When the model is about to regenerate text it has already seen, prompt lookup hands it that text as a free draft, and the acceptance rate on those spans is high.

Prompt lookup decoding claims 2x-4x speedups on input-grounded tasks, averaging about 2.4x on summarization (CNN/DailyMail) and context-based QA (HAGRID) versus a greedy decoding baseline, with no change to output quality (per the prompt-lookup-decoding repository). Because there is no draft model to load, the only cost is the string match, which is negligible.

Why ngram fits code editing

Code transformation is one of the most input-grounded tasks there is: the output file is mostly the input file with a few spans changed. Every unchanged span is text the model already has in context, so ngram lookup proposes long, high-acceptance drafts. This is why Morph serves its fast-apply model with ngram speculative decoding rather than a separate draft model.

Medusa and EAGLE

Medusa avoids a separate draft model by adding extra decoding heads to the target that predict multiple subsequent tokens in parallel. It then uses a tree-based attention mechanism to construct and verify multiple candidate continuations simultaneously at each step. Medusa-1 achieves over 2.2x speedup without compromising generation quality, and Medusa-2 reaches 2.3-3.6x.

EAGLE takes a different approach: it performs autoregression at the second-to-top-layer feature level rather than at the token level, and resolves feature-level uncertainty by incorporating a token sequence advanced by one time step. The acceleration is lossless and preserves the target model's output distribution. For LLaMA2-Chat 70B, EAGLE reached a latency speedup of 2.7x-3.5x and doubled throughput.

EAGLE-2 introduces a context-aware dynamic draft tree instead of a static one, exploiting the well-calibrated draft model whose confidence scores approximate token acceptance rates. It reaches speedup ratios of 3.05x-4.26x, 20%-40% faster than EAGLE-1. EAGLE-3 abandons feature prediction for direct token prediction with multi-layer feature fusion, reaching up to 6.5x with an average accepted length up to 7.5 on HumanEval.

Method	Benchmark / model	Reported result
Vanilla draft model	T5-XXL (Leviathan et al.)	2x-3x, identical outputs
Speculative sampling	Chinchilla 70B, XSum / HumanEval	2-2.5x, no quality loss
Medusa-1 / Medusa-2	Target with extra heads	Over 2.2x / 2.3-3.6x
EAGLE	LLaMA2-Chat 70B	2.7x-3.5x, throughput doubled
EAGLE-2	Dynamic draft tree	3.05x-4.26x
EAGLE-3	HumanEval code generation	Up to 6.5x, accepted length up to 7.5

On MT-bench, EAGLE is reported as 3x faster than vanilla decoding, 2x faster than Lookahead, and 1.6x faster than Medusa, with no fine-tuning of the target LLM. SGLang reports concrete throughput on LLaMA-Instruct 3.1 8B: 158.34 tok/s baseline, 244.10 tok/s with EAGLE-2, and 373.25 tok/s with EAGLE-3.

On Morph's Fleet: k=64 at ~10,500 tok/s

Morph builds inference infrastructure for AI coding agents. The fast-apply model, morph-v3-fast, applies code edits, and code editing is the most input-grounded workload speculative decoding has: the output file is overwhelmingly the input file with a few changed spans. That structure makes ngram drafting nearly ideal.

On Morph's fleet, morph-v3-fast runs ngram speculative decoding with k=64, proposing up to 64 candidate tokens per verification pass from ngrams already present in the input. Combined with continuous batching, this serves morph-v3-fast at ~10,500 tok/s. The k=64 draft length is large precisely because the acceptance rate on unchanged code spans is high enough to keep most of those 64 tokens.

This is a real production datapoint, not a benchmark: ~10,500 tok/s on a model whose primary job is to copy large spans of input verbatim. The same ngram approach that averages 2.4x on summarization pays off harder here because code edits reuse even more of the input than a summary does.

Why ngram over a draft model here

A separate draft model would add a second model to load and run, plus its own quality ceiling. Ngram lookup adds nothing: the drafts come from the prompt itself, the only cost is a string match, and on apply workloads the acceptance rate is high because the output is mostly the input. For code transformation, ngram is the higher-acceptance, lower-overhead choice.

See LLM Inference Optimization for how speculative decoding combines with batching, KV-cache management, and quantization, and AI Inference for the broader serving picture.

When Speculative Decoding Does Not Help

Speculative decoding is not free in every regime. It adds draft work to every step, and that work is only repaid when the target accepts enough tokens. When acceptance is low, the draft work is wasted and the system can be slower than plain decoding.

Low acceptance rate

If the draft rarely agrees with the target, most proposed tokens are rejected and the verification passes produce few free tokens. The draft overhead then outweighs the savings. Poor-quality assistants and out-of-distribution inputs cause this.

High-temperature sampling

High temperature makes the target's next-token choices less predictable, so the draft is right less often and acceptance drops. Speculative decoding favors greedy or low-temperature sampling; creative high-temperature generation benefits least.

Draft too large

The assistant must be at least an order of magnitude smaller than the target. A large draft makes the proposal step expensive, so even high acceptance does not pay for the drafting. The drafting cost must stay small relative to one target forward pass.

The HuggingFace assisted-generation analysis names the failure modes directly: benefits shrink with poor assistant quality, with high-temperature sampling that causes frequent rejection, and when assistant overhead outweighs validation savings. Input-grounded tasks at low temperature are the sweet spot; open-ended, high-temperature generation is where the gains erode.

There is also a tooling consideration. The serving stack must support the variant you want. vLLM, SGLang, and TensorRT-LLM each support a different subset of draft-model, ngram, Medusa, and EAGLE drafting, so the choice of variant is partly a choice of serving runtime.

Frequently Asked Questions

What is speculative decoding?

Speculative decoding is an algorithm that samples from an autoregressive LLM faster without changing its outputs. A small, fast draft model proposes K candidate tokens, and the large target model verifies all of them in a single forward pass. Accepted tokens are generated at nearly no extra cost. Because LLM decoding is memory-bandwidth-bound, verifying a draft in one pass is much cheaper than generating the same tokens one at a time, which yields roughly 2-3x faster inference.

Does speculative decoding change output quality?

No. It is lossless. The verification step uses a modified rejection sampling scheme that provably preserves the target model's output distribution within hardware numerics, so the generated text is statistically identical to running the target alone. It needs no retraining or architecture changes. Leviathan et al. demonstrated identical outputs on T5-XXL with a 2x-3x speedup.

What is ngram or prompt-lookup speculative decoding?

Prompt lookup decoding (ngram speculation) replaces the draft model with string matching in the prompt. When the last few generated tokens match an earlier ngram, the subsequent tokens become candidates, so no separate model is needed. It is best on input-grounded tasks, claiming 2x-4x speedups and averaging about 2.4x on summarization and context-based QA versus a greedy baseline, with no change to output quality.

What is the difference between EAGLE, Medusa, and a draft model?

A draft model is a separate smaller model. Medusa adds extra decoding heads to the target and verifies with tree attention (Medusa-1 over 2.2x, Medusa-2 2.3-3.6x). EAGLE drafts at the feature level inside the target and stays lossless (2.7x-3.5x on LLaMA2-Chat 70B); EAGLE-2 adds a dynamic draft tree (3.05x-4.26x) and EAGLE-3 predicts tokens directly for up to 6.5x. On MT-bench, EAGLE is 3x faster than vanilla decoding and 1.6x faster than Medusa.

How much speedup does speculative decoding give?

Typical published speedups are 2-3x. Leviathan et al. reached 2x-3x on T5-XXL; DeepMind's speculative sampling reached 2-2.5x on the 70B Chinchilla model. Variant-specific numbers go higher: SGLang reports LLaMA-Instruct 3.1 8B going from 158.34 tok/s baseline to 244.10 tok/s with EAGLE-2 and 373.25 tok/s with EAGLE-3. The exact figure depends on the acceptance rate.

When does speculative decoding not help?

Speedup favors an assistant at least an order of magnitude smaller than the target, input-grounded tasks, and greedy or low-temperature sampling. Benefits shrink with a poor-quality assistant, high-temperature sampling that causes frequent rejection, or when assistant overhead outweighs verification savings. Low acceptance or a draft model that is too large can make the system slower than plain decoding.

What does Morph use?

Morph serves morph-v3-fast with ngram speculative decoding at k=64, combined with continuous batching, for ~10,500 tok/s. Ngram drafting fits code editing because the output file mostly repeats the input, so the acceptance rate on unchanged spans is high and the drafts come free from the prompt with no second model to run.

Related Resources

Speculative Decoding, Running in Production

Morph serves morph-v3-fast with ngram speculative decoding at k=64 plus continuous batching for ~10,500 tok/s. Lossless: output is provably identical to the target model. Built for AI coding agents at api.morphllm.com, OpenAI-compatible.

Read the Docs

Try Morph

Fast Apply

WarpGrep

Compact

Model Router

DeepSeek

MiniMax

Qwen

Blog

Startup Credits

Students

Contact Us

About

Careers