Claude Context Window: 200K Tokens, 1M Beta, and What the Research Shows

Claude models have a 200K token context window (1M in beta). But context window size and effective context are different things. This guide covers exact specs for every Claude model, how Claude compares to GPT-4.1 and Gemini, prompt caching, context rot research, and practical token budgets for code.

March 12, 2026 ยท 2 min read

Claude's context window is 200,000 tokens. Opus 4.6 and Sonnet 4.6 support 1 million tokens in beta. Those are the specs. The more useful question: how much of that context does the model actually use well? Research from Chroma, NVIDIA, and Anthropic's own benchmarks tells a more nuanced story about context window size vs. effective context quality.

200K
Default context window (tokens)
1M
Beta context window (Opus/Sonnet 4.6)
76%
Opus 4.6 MRCR v2 score (vs 18.5% prior)
90%
Cost reduction with prompt caching

Every Claude Model's Context Window

Every current Claude model has a 200K token context window. The differences are in max output tokens, pricing, and whether the 1M beta is available.

ModelContext WindowMax OutputInput $/MTokOutput $/MTok
Claude Opus 4.6200K / 1M (beta)128K$5$25
Claude Sonnet 4.6200K / 1M (beta)64K$3$15
Claude Haiku 4.5200K64K$1$5
Claude Sonnet 4.5200K / 1M (beta)64K$3$15
Claude Opus 4.5200K64K$5$25
Claude Sonnet 4200K / 1M (beta)64K$3$15

The 1M token context window requires the context-1m-2025-08-07 beta header and is restricted to organizations in usage tier 4 or with custom rate limits. Requests exceeding 200K input tokens are charged at premium rates: 2x input and 1.5x output pricing.

200K tokens in practice

200K tokens is roughly 150,000 words, 500 pages of text, or 680,000 Unicode characters. For code, that translates to approximately 50,000-70,000 lines depending on language density. Opus 4.6's 128K max output is double the previous 64K limit, enabling longer responses and larger thinking budgets.

Context Window vs Effective Context

A context window is how many tokens you can send. Effective context is how many of those tokens the model uses reliably. They are not the same number.

Chroma published the most comprehensive study of this gap in July 2025, testing 18 models across controlled experiments. The headline finding: performance degrades at every context length increment, not just near the limit. A model with a 1M token window still exhibits measurable degradation at 50K tokens.

The Lost-in-the-Middle Problem

Research from Stanford and others established the U-shaped attention curve. LLMs attend most strongly to tokens at the beginning and end of the context. Information in the middle gets deprioritized. This pattern holds across all transformer-based architectures.

Chroma's study added a counterintuitive finding: models performed better on shuffled, incoherent contexts than logically structured ones across all 18 models tested. The attention mechanism behaves differently with structural patterns, suggesting that coherent documents may actually make retrieval harder, not easier.

50-65%
Typical effective context (NVIDIA RULER)
76%
Opus 4.6 MRCR v2 (8 needles, 1M tokens)
18.5%
Sonnet 4.5 on same MRCR v2 test

NVIDIA's RULER benchmark puts effective context at 50-65% of advertised capacity for most models. A model advertising 200K tokens typically becomes unreliable around 130K. Claude Opus 4.6 is a notable exception: its 76% score on MRCR v2 (finding 8 needles across 1M tokens) represents what Anthropic calls "a qualitative shift" in usable context, up from 18.5% for its predecessor.

Context rot is not a bug, it is physics

Quadratic attention scaling means more tokens exponentially increase the pairwise relationships the model must track. Semantically similar distractors interfere with retrieval. And positional encoding degrades at distances not seen during training. No architecture change eliminates these effects. The question is which models degrade slowest.

How Claude Compares to GPT-4.1, Gemini, and Others

Claude's 200K default context window is the smallest among frontier models. GPT-4.1 ships with 1M. Gemini 2.5 Pro has 1M with 2M coming. Llama 4 Scout claims 10M. On raw numbers, Claude loses.

On effective utilization, the picture is different. Chroma's research found that Claude models "decay the slowest overall," while GPT models were "more erratic with random mistakes and outright refusals," and Gemini "starts to mess up earlier with wild variations."

ModelContext WindowMax OutputEffective Context Notes
Claude Opus 4.6200K / 1M beta128K76% MRCR v2 at 1M; slowest decay in Chroma study
Claude Sonnet 4.6200K / 1M beta64K<5% accuracy degradation across full 200K range
GPT-4.11M32K100% needle accuracy at 900K; erratic on complex tasks
GPT-5.4128K16KPrevious generation, smaller window
Gemini 2.5 Pro1M (2M soon)64K99.7% recall at 1M; early degradation on reasoning
Gemini 3 Pro1M64K26.3% at 1M on own eval card
DeepSeek V3.1128K16KStrong at shorter contexts, untested at length
Llama 4 Scout10MvariesLargest window; limited independent benchmarks
Mistral Large 3256KvariesMid-range window, competitive pricing

GPT-4.1 claims 100% needle accuracy at 900K tokens, but needle-in-a-haystack tests measure retrieval of a single planted fact, not reasoning over distributed information. Chroma's study found GPT models had the highest hallucination rates when distractors were present and a 2.55% task refusal rate. Gemini 2.5 Pro reports 99.7% recall at 1M tokens on Google's own benchmark, but Gemini 3 Pro dropped to 26.3% at 1M on its own model evaluation card.

The numbers reveal a pattern: models perform well on the benchmarks their makers design. Independent evaluations like Chroma's paint a less optimistic picture for every model.

Prompt Caching: 90% Cost Reduction

Prompt caching is Anthropic's mechanism for reusing processed context across API calls. Instead of reprocessing the same system prompt, documents, or conversation history on every request, the API reads from cache at a fraction of the cost.

90%
Max cost reduction
85%
Max latency reduction
0.1x
Cache read vs. standard input price
2 calls
Break-even for 5-min cache

The economics are straightforward. A cache write costs 1.25x base input price for a 5-minute TTL, or 2x for a 1-hour TTL. Cache reads cost 0.1x base input price. For Sonnet 4.6 at $3/MTok input, that means a cache read costs $0.30/MTok, 10x cheaper than reprocessing. The break-even is two API calls for 5-minute caching.

OperationMultiplierSonnet 4.6 CostOpus 4.6 Cost
Standard input1x$3/MTok$5/MTok
5-min cache write1.25x$3.75/MTok$6.25/MTok
1-hour cache write2x$6/MTok$10/MTok
Cache read (hit)0.1x$0.30/MTok$0.50/MTok

For a concrete example: a 100K token system prompt with documents drops from 11.5 seconds to 2.4 seconds with caching. An application making 100 API calls per hour with a shared 50K token system prompt saves roughly $14.25/hour on Sonnet 4.6 versus reprocessing.

Prompt caching is context engineering

Caching does not just reduce cost. It changes how you architect context. When repeated context is nearly free, you can include more reference material, longer system prompts, and richer examples without the cost penalty. The effective context budget shifts from "what can I afford to send" to "what should the model see."

Extended Thinking and Context

Extended thinking lets Claude reason through problems before responding. The thinking tokens are billed as output but count toward the context window during the current turn. The critical detail: previous thinking blocks are automatically stripped from context in subsequent turns.

This means a multi-turn conversation where Claude thinks for 20K tokens per turn does not accumulate 100K tokens of thinking across 5 turns. Each turn's thinking is discarded before the next, so reasoning does not eat your context window over time.

Thinking Tokens Are Ephemeral

Previous thinking blocks are stripped automatically. A 10-turn conversation with 20K thinking tokens per turn uses the same context as one without thinking, plus the current turn's allocation.

Context Awareness

Sonnet 4.6 and Haiku 4.5 receive live token budget updates after each tool call. The model knows exactly how much context remains and paces itself accordingly.

Opus 4.6 supports 128K max output tokens, double the previous 64K limit. This gives extended thinking more room to reason through complex problems without hitting output limits. The thinking budget is set via the budget_tokens parameter, and with adaptive thinking, Claude dynamically decides how much thinking each request needs.

How Much Code Fits in 200K Tokens

Token-to-code ratios vary by language. Python and JavaScript average about 2.5-3.5 tokens per line of code. Verbose languages like Java run higher. Comments and docstrings add overhead. Here are practical estimates:

Context SizeLines of Code (est.)Equivalent Project
32K tokens~10,000-13,000 linesSingle large module or small library
128K tokens~40,000-50,000 linesMedium application or framework component
200K tokens~50,000-70,000 linesFull medium-sized codebase
1M tokens~250,000-350,000 linesLarge monorepo or enterprise application

A typical React project with 200 files averaging 150 lines (30,000 total lines) fits comfortably in 200K tokens. A 100,000+ line enterprise codebase does not. But fitting everything into context is usually the wrong approach.

Cognition (makers of Devin) measured that their AI coding agent spent 60% of its time searching for code before making changes. The bottleneck was not context window size. It was finding the right code to put in context.

Context quality over context quantity

Putting 200K tokens of code into context when you only need 5K tokens of relevant code produces worse results than sending just the 5K. Context rot means more input tokens degrade output quality. The goal is not to fill the window. It is to fill it with the right content.

Context Engineering > Context Size

The context window arms race, from 4K to 200K to 1M to 10M, misses the point. Chroma's research demonstrated that performance degrades at every length increment across all 18 models tested. Bigger windows give you capacity, not quality.

Context engineering is the practice of curating what goes into the window: selecting relevant information, removing noise, structuring content for attention patterns, and managing token budgets across multi-turn interactions. Anthropic's own documentation states that Claude "achieves state-of-the-art results on long-context retrieval benchmarks, but these gains depend on what's in context, not just how much fits."

Practical strategies

Semantic Search Before Context Loading

Find the relevant code first, then load it into context. Tools like WarpGrep use RL-trained search to find relevant code in 3.8 steps (0.73 F1) vs. 12.4 steps for baseline approaches, reducing both latency and unnecessary context consumption.

Prompt Caching for Stable Context

Cache system prompts, documentation, and reference code. Pay the write cost once, then read at 10% of input price. Keeps your effective context budget high without the cost of reprocessing.

Efficient Code Edits

Code editing is one of the highest-token operations in agent workflows. Fast Apply generates edits at 10,500 tok/s with 98% accuracy, reducing the output token cost of applying changes to files.

Subagent Isolation

Delegate search, testing, and analysis to subagents with isolated context windows. The main agent receives only the summary, keeping its context clean. Up to 10 concurrent subagents give you an effective 2M tokens across isolations.

The models with the biggest context windows are not always the best choice. A 200K window filled with precisely relevant code outperforms a 1M window filled with an entire repository. The tooling that selects what goes into context determines outcomes more than the window itself.

Frequently Asked Questions

What is Claude's context window size?

All current Claude models (Opus 4.6, Sonnet 4.6, Haiku 4.5) have a 200,000 token context window. Opus 4.6, Sonnet 4.6, Sonnet 4.5, and Sonnet 4 also support 1M tokens in beta, available to organizations in usage tier 4. The 200K window holds roughly 150,000 words or 500 pages of text.

How does Claude's context window compare to GPT-4.1 and Gemini?

GPT-4.1 has a 1M token window by default. Gemini 2.5 Pro has 1M with 2M coming. Claude's default is 200K, with 1M in beta. Raw size favors competitors, but Chroma's research found Claude models decay the slowest with increasing context length, while GPT models showed more erratic behavior and Gemini degraded earlier on complex tasks.

What is prompt caching and how much does it save?

Prompt caching lets you mark context for reuse across API calls. Cache reads cost 10% of standard input price, saving up to 90% on costs and 85% on latency. A 100K token prompt drops from 11.5s to 2.4s with caching. The break-even is two API calls for 5-minute caching, one call for any frequently used context.

Does Claude's performance degrade with longer contexts?

Yes. All LLMs exhibit context rot. The lost-in-the-middle effect means information in the center of context gets deprioritized. Claude Opus 4.6 made significant progress, scoring 76% on MRCR v2 vs. 18.5% for its predecessor, but degradation still occurs at every length increment. The practical response is context engineering: sending relevant content rather than maximizing volume.

How much code fits in Claude's 200K token context window?

Approximately 50,000-70,000 lines of code. A medium-sized project (200 files, 150 lines each) fits comfortably. A large enterprise codebase (100K+ lines) does not. For large codebases, semantic search tools that find relevant code before context loading produce better results than trying to fit everything in.

What is the pricing for Claude's 1M token context window?

Requests exceeding 200K input tokens are charged at premium rates: 2x input and 1.5x output. For Opus 4.6, that means $10/MTok input and $37.50/MTok output. For Sonnet, $6/MTok input and $22.50/MTok output. Requests under 200K use standard pricing even with the 1M beta enabled. Prompt caching multipliers stack on top of long-context rates.

Related Reading

Better Context, Not More Context

WarpGrep finds relevant code in 3.8 steps (0.73 F1) so your agent fills its context window with signal, not noise. Fast Apply generates code edits at 10,500 tok/s. Both work with Claude, GPT, Gemini, or any LLM.