Grouped-Query Attention (GQA) vs Multi-Query and Multi-Head Attention

Grouped-query attention (GQA) splits a Transformer's query heads into G groups that each share one key-value head. It sits between multi-head attention, where every query head has its own KV head, and multi-query attention, where all query heads share one. Llama 3 70B uses 64 query heads with 8 KV heads, an 8x KV-cache reduction over MHA, recovering near-MHA quality at near-MQA speed. Morph serves long-context models up to 262k context where this cache footprint is the binding constraint.

KV-cache reduction (Llama 3 70B, 64:8 heads)

64:8

Query heads : KV heads in Llama 3 70B

~5%

Pre-training compute to uptrain MHA to GQA

262k

Max context on Morph's fleet

The Problem: KV Cache Dominates Decode Memory

Autoregressive decoder inference is a severe bottleneck for Transformer models due to the memory-bandwidth overhead of loading the decoder weights and all attention keys and values at every decoding step. The compute is cheap. Reloading data from GPU memory is the cost.

A KV cache stores the key and value vectors computed for already-processed tokens so they are not recomputed on each step. Without it, the model would redo the same key and value projections for the whole prefix every single step. With it, the cache grows to store an increasing number of keys and values as generation progresses, and at long context it becomes a bottleneck.

The numbers are large. For a Llama-2 7B model in float16 at a 10,000-token context, the KV cache needs roughly 5 GB just to store the previous key-value cache, about one-third of the model's half-precision parameter storage (per Hugging Face's KV-cache analysis). The cache lifetime and length are not known ahead of time, which is why it dominates memory at long context unlike the fixed model weights.

The KV cache stores one key and one value vector per key-value head per layer per token. Reduce the number of key-value heads and the cache shrinks proportionally. That is the entire idea behind multi-query and grouped-query attention.

Why decode is memory-bound

The bottleneck in the forward pass is loading layer weights and the KV cache into the compute cores, not the arithmetic itself. Anything that reduces bytes-loaded-per-step (a smaller KV cache, fewer KV heads) directly raises decode throughput. This is the same reason speculative decoding works: verifying a batch of draft tokens in one pass loads the weights once instead of once per token.

MHA, MQA, and GQA in One Picture

All three are the same attention mechanism with a different number of key-value heads. The query heads stay the same. Only how many distinct KV heads they read from changes.

Multi-head attention (MHA) is the standard formulation: the input is projected into query, key, and value tensors using k separate attention heads, each with its own key and value projection. Every query head reads its own dedicated KV pair.

Multi-query attention (MQA) shares a single key head and a single value head across all query heads. The query is still projected to its full per-head shape, but the key and value projections collapse to one. This is the "one write-head" design.

Grouped-query attention (GQA) divides the query heads into G groups, each of which shares one key head and one value head. GQA-1 (a single group) is equivalent to MQA. GQA with G equal to the number of query heads is equivalent to MHA. Any G in between interpolates.

MHA vs MQA vs GQA

Scheme	KV heads	KV cache size	Quality	Decode speed	Used by
Multi-head (MHA)	One per query head	Largest (1x baseline)	Highest	Slowest	Llama 2 7B/13B
Grouped-query (GQA)	G groups (e.g. 8)	Reduced (e.g. 8x smaller)	Close to MHA	Near MQA	Llama 3, Mistral 7B
Multi-query (MQA)	One shared	Smallest (divided by H)	Can degrade	Fastest	PaLM, Falcon

The progression is monotonic. Fewer KV heads means a smaller cache, less data loaded per step, faster decode, and more room for batch or context, traded against the risk of quality loss when you collapse too far toward a single shared head.

Head Grouping, Concretely

Take a model with 64 query heads. Under MHA there are 64 key heads and 64 value heads, one per query head. Under MQA there is 1 key head and 1 value head, shared by all 64 query heads. Under GQA with 8 groups, there are 8 key heads and 8 value heads, and each KV head is shared by a group of 8 query heads.

Head grouping for a 64-query-head model

# MHA: 64 query heads, 64 KV heads (1 KV head per query head)
query heads:  q0  q1  q2  ... q62 q63
key  heads:   k0  k1  k2  ... k62 k63   # 64 keys
value heads:  v0  v1  v2  ... v62 v63   # 64 values

# GQA-8: 64 query heads grouped into 8 groups, 8 KV heads
group 0: [q0..q7]   -> k0, v0
group 1: [q8..q15]  -> k1, v1
group 2: [q16..q23] -> k2, v2
...
group 7: [q56..q63] -> k7, v7
# 8 keys + 8 values cached instead of 64 + 64

# MQA: 64 query heads, 1 shared KV head
all query heads [q0..q63] -> k0, v0
# 1 key + 1 value cached

The query projection is unchanged across all three. What shrinks is the number of distinct key and value projections, and therefore the number of key and value vectors written into the cache for every token. Llama 3 picks 8 groups: 32 query heads with 8 KV heads on the 8B model, 64 query heads with 8 KV heads on the 70B model.

The KV-Cache Reduction Math

The memory required to store the KV cache of one token is roughly:

Per-token KV-cache size

bytes_per_token = 2 * 2 * num_layers * num_key_value_heads * head_dim

#   first 2  -> keys AND values are both stored
#   second 2 -> bytes per element in float16
#   only num_key_value_heads differs between MHA, GQA, MQA

Every term except num_key_value_heads is identical between MHA, GQA, and MQA. The cache therefore shrinks in direct proportion to how many KV heads you keep. Going from MHA to MQA reduces H key-value heads to a single one, shrinking the cache (and the data loaded per step) by a factor of H, the head count.

KV-cache reduction by scheme (64-query-head model)

Scheme	KV heads	Cache vs MHA	Effect
MHA	64	1x (baseline)	Largest cache, highest quality
GQA-8	8	8x smaller	Llama 3 70B configuration
GQA-2	2	32x smaller	Aggressive grouping
MQA	1	64x smaller	Maximum reduction, quality risk

Llama 2 70B and Llama 3 70B both use 64 query attention heads and 8 key-value heads, an 8x reduction in KV-cache size versus a 64-KV-head MHA model. Mistral 7B uses 32 query heads with 8 KV heads, a 4x reduction. The model author chooses G to trade cache footprint against quality.

Llama 3 70B (64:8 heads)

Mistral 7B (32:8 heads)

64x

MQA on a 64-head model

~5 GB

Llama-2 7B MHA cache at 10k tokens

Why Smaller KV Cache Means Longer Context and Higher Throughput

The KV cache competes with two things for GPU memory: the context window and the batch size. A larger cache per token means either fewer tokens of context or fewer concurrent sequences. Shrinking the cache with GQA buys back both.

Longer context: the cache scales linearly with sequence length. If MHA fills VRAM at 16k tokens, an 8x-smaller GQA cache fits roughly 8x more tokens of context in the same memory before hitting the same ceiling, holding everything else constant. This is why long-context models lean on GQA.

Higher throughput: multi-query and grouped-query attention read only a fraction of the key-value data from memory and reduce the cache size, allowing room for larger batch sizes (per NVIDIA's inference-optimization guidance). More sequences batched together means more tokens generated per unit of memory-bandwidth, which is the bound resource. NVIDIA frames GQA explicitly as balancing the memory requirement against model quality.

The two benefits compound with serving-system optimizations. PagedAttention limits KV-cache memory waste to under 4% (prior systems wasted 60-80% to fragmentation and over-reservation), and continuous batching inserts new sequences into the batch the moment one finishes. GQA reduces the per-token cost; these systems pack what remains efficiently. See LLM inference optimization for how these layers stack.

The Quality Tradeoff

Reducing KV heads is not free. The MQA extreme, a single shared key-value head, is where quality risk appears. The GQA paper found that multi-query attention can lead to quality degradation and training instability, in particular when combined with long-input tasks. The original MQA paper itself reports only minor quality degradation, but the instability under fine-tuning is the practical concern.

GQA is the answer to that risk. An intermediate number of groups yields an interpolated model that is higher quality than MQA but faster than MHA. In the paper's framing, uptrained GQA achieves quality close to multi-head attention while being almost as fast as multi-query attention. You keep most of the speed without paying most of the quality cost.

Conversion is cheap. The GQA paper shows that existing MHA checkpoints can be uptrained to use GQA (or MQA) with about 5% of the original pre-training compute. A model author does not need to train from scratch to adopt GQA; they convert an existing MHA model and fine-tune briefly. All experiments in the paper are based on the T5.1.1 architecture.

The tradeoff stated plainly

MQA is fastest and smallest but can degrade quality and destabilize training on long inputs. MHA is highest quality but carries the full KV cache. GQA picks a G in between to get most of MQA's memory and speed with most of MHA's quality. There is no free lunch: a more aggressive grouping (smaller G) means a smaller cache and higher quality risk. Llama 3 and Mistral converged on 8 KV heads as a practical operating point.

Which Models Use GQA

GQA became standard for open-weight models after Llama 2 demonstrated it on large models. The configurations below are read from the models' published configs.

Attention configuration by model

Model	Query heads	KV heads	Scheme	KV reduction
Llama 2 7B / 13B	32	32	MHA	1x
Llama 2 34B / 70B	64	8	GQA	8x
Llama 3 8B	32	8	GQA	4x
Llama 3 70B	64	8	GQA	8x
Mistral 7B	32	8	GQA	4x
PaLM	many	1	MQA	Hx
Falcon 7B / 40B	many	1	MQA	Hx

Llama 2 adopted GQA specifically to improve inference scalability for its larger models: the 34B and 70B use GQA while the 7B and 13B keep standard MHA. Llama 3 extended GQA to every size, including the 8B. Mistral 7B uses GQA from the start with 32 query heads and 8 KV heads.

The full-MQA models are older or take a different bet. PaLM uses multi-query attention and reports a neutral effect on quality and training speed with significant decode-time cost savings. Falcon-7B and Falcon-40B use multi-query attention together with FlashAttention, with Falcon-40B using an internal variant that keeps independent keys and values per tensor-parallel degree.

GQA in Production Serving

On a serving fleet, KV-cache footprint sets the ceiling on both the context window you can offer and the number of concurrent requests you can batch. A smaller cache, via GQA, is what makes long-context serving economically viable rather than a feature you bolt on later.

Morph builds inference infrastructure for AI coding agents and runs long-context models up to 262k context on its production fleet, where the KV cache is the binding constraint on context length and batch size. The served models include glm52-744b and minimax27-230b (~140 tok/s); at those parameter counts and context lengths the per-token KV-cache footprint, set by the model's KV-head count, directly determines how many sequences fit on a node and therefore the cost per request.

GQA is one layer of the memory-bandwidth story. It cuts the per-token cache. PagedAttention cuts fragmentation waste to under 4%. FP8 KV-cache quantization halves the bytes per element again. Speculative decoding amortizes weight loads across multiple tokens. Each attacks the same memory-bandwidth bottleneck from a different angle, and they stack. For how a long-context window is sized against this memory budget, see LLM context window.

Frequently Asked Questions

What is grouped-query attention?

Grouped-query attention (GQA) divides a Transformer's query heads into G groups, each sharing a single key head and value head. It uses more than one but fewer key-value heads than query heads, an intermediate point between MHA (one KV head per query head) and MQA (one shared KV head). GQA-1 equals MQA; GQA with groups equal to the head count equals MHA.

What is the difference between GQA, MQA, and MHA?

MHA gives every query head its own key and value head: largest KV cache, highest quality, slowest decode. MQA shares one key-value head across all query heads: smallest cache, fastest, but risks quality degradation and training instability. GQA is the middle ground, G groups each sharing one KV head, recovering quality close to MHA while running almost as fast as MQA.

How much does GQA reduce the KV cache?

The cache scales directly with the number of key-value heads, since per-token cache is roughly 2 * 2 * num_layers * num_key_value_heads * head_dim bytes. Llama 3 70B uses 64 query heads with 8 KV heads, an 8x reduction versus the 64-KV-head MHA equivalent. Full MQA (one KV head) reduces the cache by a factor of H, the total head count.

Does GQA hurt model quality?

GQA recovers quality close to multi-head attention while running almost as fast as multi-query attention. The risk lives at the MQA extreme: the GQA paper found multi-query attention can cause quality degradation and training instability, especially on long-input tasks. Keeping several KV-head groups rather than one shared head is what restores near-MHA quality.

Which models use grouped-query attention?

Llama 2 adopted GQA for its larger models (34B and 70B use 64 query heads with 8 KV heads; 7B and 13B use MHA). Llama 3 8B uses 32 query heads with 8 KV heads, and Llama 3 70B uses 64 query heads with 8 KV heads. Mistral 7B uses GQA with 32 query heads and 8 KV heads. PaLM and Falcon use the related multi-query attention.

Why does GQA enable longer context?

The KV cache grows with context length and at long context dominates GPU memory, since it stores keys and values for every prior token. Keeping fewer key-value heads shrinks the per-token cache proportionally, freeing memory the model can spend on a longer context window or on larger batches for higher throughput. Smaller KV cache is what makes long-context serving affordable.

Can an existing MHA model be converted to GQA?

Yes. The GQA paper shows existing MHA checkpoints can be uptrained to GQA or MQA with about 5% of the original pre-training compute. Uptrained GQA achieves quality close to multi-head attention while running almost as fast as multi-query attention, so model authors add the KV-cache reduction without training from scratch.

Related Resources

Private deployments

The fastest endpoints are private deployments

Morph's top speeds come from dedicated deployments, not shared public endpoints: speculators trained on your traffic, caching tuned to your workload, and volume discounts over public per-token rates. Over 100 billion tokens per day run this way.

Talk to us about a private deployment

Long-Context Serving, Built on a Smaller KV Cache

Grouped-query attention is why long-context models are affordable to serve. Morph runs models up to 262k context on its production fleet, where KV-cache footprint sets the ceiling on context length and batch size. Access them through one OpenAI-compatible API at api.morphllm.com.

Read the Docs

Explore the Models

Kimi K3

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers