LLM Context Window Comparison (2026): Every Model, Priced and Benchmarked

Complete LLM context window comparison table for 2026. Every major model's context length, max output, and per-token pricing. Plus the data most comparison pages skip: pricing surcharges above 200K tokens, effective vs advertised context (RULER benchmarks), and context rot across 18 models.

February 27, 2026 ยท 3 min read

Context windows have grown from 4K tokens to 10 million. But the number on a model card does not tell you what you actually get. Effective context degrades long before you fill the window. Pricing surcharges kick in at hidden thresholds. And more context often produces worse output, not better. This is every model, compared honestly.

10M
Largest context window (Llama 4 Scout)
20+
Models compared with pricing
2x
Hidden pricing surcharge above 200K
-63.9
Worst RULER score drop (Mixtral)

LLM Context Window Comparison Table (February 2026)

Every major model, sorted by context window size. Pricing is per million tokens. Where models have tiered pricing (e.g., different rates above 128K or 200K tokens), the base rate is listed first with the surcharge rate after the slash.

ModelProviderContextMax OutputInput $/MOutput $/M
Llama 4 ScoutMeta10M-FreeFree
Grok 4xAI2M-$3.00$15.00
Grok 4.1 FastxAI2M-$0.20$0.50
Gemini 2.5 ProGoogle1M (2M beta)64K$1.25 / $2.50$10 / $15
Gemini 2.5 FlashGoogle1M8K$0.15$0.60
GPT-4.1OpenAI1M32K$2.00$8.00
GPT-4.1 MiniOpenAI1M32K$0.40$1.60
Llama 4 MaverickMeta1M-FreeFree
GPT-5.2OpenAI400K128K$1.75$14.00
GPT-5OpenAI400K128K$1.25$10.00
GPT-5 NanoOpenAI400K128K$0.05$0.40
o3OpenAI200K100K$2.00$8.00
Claude Opus 4.6Anthropic200K (1M beta)64K$5.00$25.00
Claude Sonnet 4.6Anthropic200K (1M beta)64K$3.00$15.00
Claude Haiku 4.5Anthropic200K-$1.00$5.00
DeepSeek R1DeepSeek128K64K$0.55$2.19
DeepSeek V3DeepSeek128K8K$0.14$0.28
Mistral Large 3Mistral128K-$2.00$6.00
Qwen3-235BAlibaba128K-~$0.30-0.70~$3-8
GPT-4oOpenAI128K16K$2.50$10.00

About this table

Pricing reflects API rates from each provider as of February 2026. Free models (Llama 4 Scout, Maverick) require self-hosting or third-party inference. "Max Output" marked as "-" means the provider does not publish a separate output cap. Gemini and Claude tiered pricing: the rate after the slash applies when the request exceeds 200K tokens.

Understanding Context Windows

A context window is the total number of tokens an LLM can process in a single request. This includes both your input (the prompt, system instructions, conversation history, files) and the model's output (its response). The two share the same budget.

Input vs. Output Tokens

Most models set separate limits. GPT-5.2 has a 400K context window with a 128K max output, meaning your input can be up to 272K tokens. Claude Opus 4.6 offers 200K context with 64K output. The gap between total context and max output is your input budget.

This distinction matters for coding agents. A model with 1M context but 8K max output (like Gemini 2.5 Flash) can ingest a massive codebase but can only generate a short response per turn. A model with 400K context but 128K max output (like GPT-5) can generate much longer responses, which matters for multi-file edits.

How Tokenization Works

Tokens are not words. They are sub-word units determined by each model's tokenizer. A rough rule of thumb: 1 token is about 0.75 English words, or roughly 4 characters. Code typically tokenizes less efficiently than prose because of special characters, indentation, and variable names.

Practical equivalents for a 128K context window:

  • ~96,000 English words (~300 pages of text)
  • ~50,000-70,000 lines of code (depending on language)
  • A medium-sized codebase, or a large codebase with selective file inclusion

Context window != memory

A context window is not persistent memory. Every API call starts fresh. If you send 100K tokens in one request, the model does not "remember" them on the next request. Conversation history must be re-sent each time, which is why long sessions accumulate tokens fast and why context compression becomes essential.

Long-Context Pricing: The Hidden Surcharges

Most comparison pages list per-token pricing without mentioning the surcharges that activate at longer context lengths. These surcharges can double your actual cost.

ProviderSurcharge ThresholdInput MultiplierOutput MultiplierScope
Anthropic200K tokens2x1.5xALL tokens (not just overage)
Google200K tokens2x2xALL tokens (not just overage)
OpenAINone1x1xNo surcharge at any length
xAINone published1x1xNo surcharge documented
DeepSeekNone published1x1xNo surcharge documented

The surcharge applies to ALL tokens

Anthropic and Google do not charge extra only on the tokens above the threshold. When a request crosses 200K, the higher rate applies to every token in the request. A 201K-token request to Claude Sonnet 4.6 costs $3.00/M on input and $15.00/M on output. A 199K-token request costs $3.00/M on input and $15.00/M on output. But at 200K+, Anthropic switches to $6.00/M input and $22.50/M output for the entire request. This cliff-edge pricing means crossing 200K by even one token doubles your input cost.

What This Means in Practice

Consider a coding agent that routinely sends 250K-token requests to Claude Sonnet 4.6. Without the surcharge, input would cost $0.75 per request. With the surcharge (2x on all tokens), it costs $1.50 per request. Over 1,000 requests, that is an extra $750. The surcharge is not a rounding error. It is a line item in your infrastructure budget.

OpenAI's decision not to surcharge at any length is a meaningful competitive advantage for workloads that consistently exceed 200K tokens. GPT-4.1 at $2.00/M stays at $2.00/M whether you send 10K or 900K tokens.

Effective Context vs. Advertised Context

A model's advertised context window is its theoretical capacity. Effective context is how well the model actually performs at that length. The gap between the two is often enormous.

RULER Benchmark Results

The RULER benchmark tests model performance at increasing context lengths with tasks like needle-in-a-haystack retrieval, multi-key-value lookup, and pattern matching. Performance at 4K tokens serves as the baseline.

ModelClaimed ContextScore @ 4KScore @ 128KPerformance Drop
Gemini 1.5 Pro1M96.794.4-2.3 pts
GPT-4-1106128K96.681.2-15.4 pts
Llama 3.1-70B128K96.566.6-29.9 pts
Mixtral-8x22B64K95.631.7-63.9 pts

Gemini 1.5 Pro is the clear outlier. It loses only 2.3 points going from 4K to 128K, meaning it uses nearly its full context window effectively. Every other model tested shows significant degradation. Mixtral-8x22B, despite advertising a 64K window, produces near-random results at 128K with a 63.9-point drop.

The practical implication: a model advertising 128K context might give you GPT-4-level quality at 4K tokens but mid-tier quality at 128K. You are not buying the same model at every context length. Performance at your typical input size matters more than the maximum advertised number.

-2.3
Gemini 1.5 Pro drop (4K to 128K)
-15.4
GPT-4-1106 drop (4K to 128K)
-29.9
Llama 3.1-70B drop (4K to 128K)
-63.9
Mixtral-8x22B drop (4K to 128K)

Context Rot: Why More Context Means Worse Output

Context rot is the degradation in LLM output quality as input length grows. It is not a theoretical concern. Chroma tested 18 models and measured it directly.

Chroma's Findings Across 18 Models

Every model tested showed degradation as context grew. No exceptions. But the most counterintuitive finding was about text ordering:

ConditionPerformanceWhy
Shuffled textHigher accuracyNo narrative structure to create positional bias
Coherent textLower accuracyRecency bias causes models to over-weight later passages

Models performed better on shuffled text than on coherent text. This is not a bug. Coherent text creates stronger positional patterns. The model develops recency bias, attending disproportionately to passages near the end of the input and neglecting earlier content. Shuffled text disrupts this bias, forcing more uniform attention across the input.

The implication for real-world use: ordering matters. If you put critical information at the start of a long prompt and less important content at the end, the model may still prioritize the later material. This makes naive "stuff everything into the context" approaches unreliable. It is not just a matter of fitting within the window. It is a matter of what the model actually attends to within that window.

Context rot is not about context limits

Context rot happens well before you hit the model's context window limit. A model with 200K tokens of capacity can start degrading at 50K. The window tells you what fits. It does not tell you what the model will actually use effectively. This is why context compression improves output quality, not just cost.

Cost at Scale: What Context Actually Costs

Per-token rates look similar in isolation. At scale, the gaps are massive. Here is what 1 billion input tokens per month costs for each tier of model.

$100
Gemini 2.5 Flash-Lite (1B tok/mo)
$100
GPT-4.1 Nano (1B tok/mo)
$140
DeepSeek V3 (1B tok/mo)
$3,000
Claude Sonnet 4.6 (1B tok/mo)
$5,000
Claude Opus 4.6 (1B tok/mo)
35x
Gap: cheapest to most expensive

The budget tier (DeepSeek V3, Gemini Flash-Lite, GPT-4.1 Nano) clusters around $100-140 per billion input tokens. The premium tier (Claude Sonnet, Opus) runs $3,000-5,000. That is a 35x spread. And this is before accounting for Anthropic's 2x surcharge above 200K tokens, which would push Opus to $10,000 per billion tokens for long-context workloads.

Where Context Costs Compound

Coding agents are the worst case for context costs. A single agentic session might make 50-100 API calls, each carrying an expanding conversation history. If the agent reads files, runs commands, and accumulates tool outputs, the 50th call might include 150K tokens of conversation history. Multiply by thousands of users and the costs scale nonlinearly.

This is where context compression pays for itself. Reducing token count by 50% does not save 50% on a single request. It saves 50% on every subsequent request in the session, because the compressed history is carried forward. For a 100-call agent session, compressing context early can cut total session cost by 60-70%.

The Alternative: Less Context, Better Results

The race to bigger context windows assumes more context is always better. The research says otherwise.

Compression Outperforms Raw Context

CompLLM research demonstrated that 2x compressed context actually surpasses uncompressed performance on very long sequences. The mechanism is straightforward: removing noise improves the signal-to-noise ratio. The model has less to process and more of what remains is relevant.

The retrieve-then-solve approach produced an even starker result. By selecting only relevant context instead of feeding the full input, Mistral improved from 35.5% to 66.7% accuracy. Nearly double. Not by giving the model more context, but by giving it better context.

ApproachStrategyResult
Full context (Mistral)Send everything35.5% accuracy
Retrieve-then-solve (Mistral)Select relevant context66.7% accuracy
2x compressed (CompLLM)Compress before sendingSurpasses uncompressed

Morph Compact: Verbatim Context Reduction

Morph Compact reduces context by 50-70% while keeping every surviving sentence word-for-word identical to the original. No paraphrasing. No summarization. No hallucination risk in the compressed output. It runs at 3,300+ tokens per second with 98% verbatim accuracy.

For teams paying $3,000-5,000 per billion tokens on premium models, compacting context before sending it can cut that to $1,000-2,500 while also improving output quality by reducing noise. The cost savings and quality improvements work in the same direction.

50-70%
Context reduction (Morph Compact)
98%
Verbatim accuracy
3,300+
Tokens per second
0%
Hallucination risk

Frequently Asked Questions

Which LLM has the largest context window in 2026?

Llama 4 Scout from Meta holds the largest context window at 10 million tokens. Grok 4 from xAI follows at 2 million. Gemini 2.5 Pro, GPT-4.1, GPT-4.1 Mini, Gemini 2.5 Flash, and Llama 4 Maverick all support 1 million tokens. But the largest window does not mean the best performance. RULER benchmark tests show most models degrade significantly before reaching their advertised limits.

How much does it cost to use a 1 million token context window?

Costs vary dramatically. Filling a 1M context with GPT-4.1 costs $2.00 per request. Gemini 2.5 Flash costs $0.15 for the same input. Llama 4 Scout and Maverick are free to use (but require self-hosting or third-party inference). Anthropic and Google charge 2x surcharges above 200K tokens, which can double effective cost for long-context requests.

What is the difference between context window and max output tokens?

The context window is the total token budget for input plus output. Max output tokens is the ceiling on how much the model can generate. GPT-5.2 has a 400K context window with 128K max output, so your input can be up to 272K tokens. A model with large context but small max output (like Gemini 2.5 Flash at 1M/8K) can read a lot but writes short responses per turn.

Do LLMs actually use their full context window effectively?

No. RULER benchmark testing shows significant degradation at longer contexts. GPT-4-1106 drops from 96.6 at 4K to 81.2 at 128K. Llama 3.1-70B drops from 96.5 to 66.6. Gemini 1.5 Pro is the exception, holding at 94.4 at 128K with only a 2.3-point drop. Performance at your typical input size matters more than the maximum advertised number.

What is context rot and why does it matter?

Context rot is the degradation in LLM output quality as input length grows. Chroma tested 18 models and found every one degrades. Models performed better on shuffled text than coherent text because coherent text creates stronger recency bias. Filling a large context window with ordered information can produce worse results than a shorter, more focused input.

Is it better to use a larger context window or compress the context?

Research increasingly favors compression. CompLLM showed 2x compressed context surpasses uncompressed on long sequences. Morph Compact achieves 50-70% reduction with 98% verbatim accuracy at 3,300+ tok/s. The retrieve-then-solve approach improved Mistral from 35.5% to 66.7% by selecting relevant context. For most applications, focused input outperforms stuffed input.

Stop Paying for Wasted Context

Morph Compact reduces context by 50-70% while keeping every surviving sentence verbatim. Cut your token costs and improve output quality at the same time. 3,300+ tok/s, 98% verbatim accuracy, zero hallucination risk.