Claude's context window is 200,000 tokens. Opus 4.6 and Sonnet 4.6 support 1 million tokens in beta. Those are the specs. The more useful question: how much of that context does the model actually use well? Research from Chroma, NVIDIA, and Anthropic's own benchmarks tells a more nuanced story about context window size vs. effective context quality.
Every Claude Model's Context Window
Every current Claude model has a 200K token context window. The differences are in max output tokens, pricing, and whether the 1M beta is available.
| Model | Context Window | Max Output | Input $/MTok | Output $/MTok |
|---|---|---|---|---|
| Claude Opus 4.6 | 200K / 1M (beta) | 128K | $5 | $25 |
| Claude Sonnet 4.6 | 200K / 1M (beta) | 64K | $3 | $15 |
| Claude Haiku 4.5 | 200K | 64K | $1 | $5 |
| Claude Sonnet 4.5 | 200K / 1M (beta) | 64K | $3 | $15 |
| Claude Opus 4.5 | 200K | 64K | $5 | $25 |
| Claude Sonnet 4 | 200K / 1M (beta) | 64K | $3 | $15 |
The 1M token context window requires the context-1m-2025-08-07 beta header and is restricted to organizations in usage tier 4 or with custom rate limits. Requests exceeding 200K input tokens are charged at premium rates: 2x input and 1.5x output pricing.
200K tokens in practice
200K tokens is roughly 150,000 words, 500 pages of text, or 680,000 Unicode characters. For code, that translates to approximately 50,000-70,000 lines depending on language density. Opus 4.6's 128K max output is double the previous 64K limit, enabling longer responses and larger thinking budgets.
Context Window vs Effective Context
A context window is how many tokens you can send. Effective context is how many of those tokens the model uses reliably. They are not the same number.
Chroma published the most comprehensive study of this gap in July 2025, testing 18 models across controlled experiments. The headline finding: performance degrades at every context length increment, not just near the limit. A model with a 1M token window still exhibits measurable degradation at 50K tokens.
The Lost-in-the-Middle Problem
Research from Stanford and others established the U-shaped attention curve. LLMs attend most strongly to tokens at the beginning and end of the context. Information in the middle gets deprioritized. This pattern holds across all transformer-based architectures.
Chroma's study added a counterintuitive finding: models performed better on shuffled, incoherent contexts than logically structured ones across all 18 models tested. The attention mechanism behaves differently with structural patterns, suggesting that coherent documents may actually make retrieval harder, not easier.
NVIDIA's RULER benchmark puts effective context at 50-65% of advertised capacity for most models. A model advertising 200K tokens typically becomes unreliable around 130K. Claude Opus 4.6 is a notable exception: its 76% score on MRCR v2 (finding 8 needles across 1M tokens) represents what Anthropic calls "a qualitative shift" in usable context, up from 18.5% for its predecessor.
Context rot is not a bug, it is physics
Quadratic attention scaling means more tokens exponentially increase the pairwise relationships the model must track. Semantically similar distractors interfere with retrieval. And positional encoding degrades at distances not seen during training. No architecture change eliminates these effects. The question is which models degrade slowest.
How Claude Compares to GPT-4.1, Gemini, and Others
Claude's 200K default context window is the smallest among frontier models. GPT-4.1 ships with 1M. Gemini 2.5 Pro has 1M with 2M coming. Llama 4 Scout claims 10M. On raw numbers, Claude loses.
On effective utilization, the picture is different. Chroma's research found that Claude models "decay the slowest overall," while GPT models were "more erratic with random mistakes and outright refusals," and Gemini "starts to mess up earlier with wild variations."
| Model | Context Window | Max Output | Effective Context Notes |
|---|---|---|---|
| Claude Opus 4.6 | 200K / 1M beta | 128K | 76% MRCR v2 at 1M; slowest decay in Chroma study |
| Claude Sonnet 4.6 | 200K / 1M beta | 64K | <5% accuracy degradation across full 200K range |
| GPT-4.1 | 1M | 32K | 100% needle accuracy at 900K; erratic on complex tasks |
| GPT-5.4 | 128K | 16K | Previous generation, smaller window |
| Gemini 2.5 Pro | 1M (2M soon) | 64K | 99.7% recall at 1M; early degradation on reasoning |
| Gemini 3 Pro | 1M | 64K | 26.3% at 1M on own eval card |
| DeepSeek V3.1 | 128K | 16K | Strong at shorter contexts, untested at length |
| Llama 4 Scout | 10M | varies | Largest window; limited independent benchmarks |
| Mistral Large 3 | 256K | varies | Mid-range window, competitive pricing |
GPT-4.1 claims 100% needle accuracy at 900K tokens, but needle-in-a-haystack tests measure retrieval of a single planted fact, not reasoning over distributed information. Chroma's study found GPT models had the highest hallucination rates when distractors were present and a 2.55% task refusal rate. Gemini 2.5 Pro reports 99.7% recall at 1M tokens on Google's own benchmark, but Gemini 3 Pro dropped to 26.3% at 1M on its own model evaluation card.
The numbers reveal a pattern: models perform well on the benchmarks their makers design. Independent evaluations like Chroma's paint a less optimistic picture for every model.
Prompt Caching: 90% Cost Reduction
Prompt caching is Anthropic's mechanism for reusing processed context across API calls. Instead of reprocessing the same system prompt, documents, or conversation history on every request, the API reads from cache at a fraction of the cost.
The economics are straightforward. A cache write costs 1.25x base input price for a 5-minute TTL, or 2x for a 1-hour TTL. Cache reads cost 0.1x base input price. For Sonnet 4.6 at $3/MTok input, that means a cache read costs $0.30/MTok, 10x cheaper than reprocessing. The break-even is two API calls for 5-minute caching.
| Operation | Multiplier | Sonnet 4.6 Cost | Opus 4.6 Cost |
|---|---|---|---|
| Standard input | 1x | $3/MTok | $5/MTok |
| 5-min cache write | 1.25x | $3.75/MTok | $6.25/MTok |
| 1-hour cache write | 2x | $6/MTok | $10/MTok |
| Cache read (hit) | 0.1x | $0.30/MTok | $0.50/MTok |
For a concrete example: a 100K token system prompt with documents drops from 11.5 seconds to 2.4 seconds with caching. An application making 100 API calls per hour with a shared 50K token system prompt saves roughly $14.25/hour on Sonnet 4.6 versus reprocessing.
Prompt caching is context engineering
Caching does not just reduce cost. It changes how you architect context. When repeated context is nearly free, you can include more reference material, longer system prompts, and richer examples without the cost penalty. The effective context budget shifts from "what can I afford to send" to "what should the model see."
Extended Thinking and Context
Extended thinking lets Claude reason through problems before responding. The thinking tokens are billed as output but count toward the context window during the current turn. The critical detail: previous thinking blocks are automatically stripped from context in subsequent turns.
This means a multi-turn conversation where Claude thinks for 20K tokens per turn does not accumulate 100K tokens of thinking across 5 turns. Each turn's thinking is discarded before the next, so reasoning does not eat your context window over time.
Thinking Tokens Are Ephemeral
Previous thinking blocks are stripped automatically. A 10-turn conversation with 20K thinking tokens per turn uses the same context as one without thinking, plus the current turn's allocation.
Context Awareness
Sonnet 4.6 and Haiku 4.5 receive live token budget updates after each tool call. The model knows exactly how much context remains and paces itself accordingly.
Opus 4.6 supports 128K max output tokens, double the previous 64K limit. This gives extended thinking more room to reason through complex problems without hitting output limits. The thinking budget is set via the budget_tokens parameter, and with adaptive thinking, Claude dynamically decides how much thinking each request needs.
How Much Code Fits in 200K Tokens
Token-to-code ratios vary by language. Python and JavaScript average about 2.5-3.5 tokens per line of code. Verbose languages like Java run higher. Comments and docstrings add overhead. Here are practical estimates:
| Context Size | Lines of Code (est.) | Equivalent Project |
|---|---|---|
| 32K tokens | ~10,000-13,000 lines | Single large module or small library |
| 128K tokens | ~40,000-50,000 lines | Medium application or framework component |
| 200K tokens | ~50,000-70,000 lines | Full medium-sized codebase |
| 1M tokens | ~250,000-350,000 lines | Large monorepo or enterprise application |
A typical React project with 200 files averaging 150 lines (30,000 total lines) fits comfortably in 200K tokens. A 100,000+ line enterprise codebase does not. But fitting everything into context is usually the wrong approach.
Cognition (makers of Devin) measured that their AI coding agent spent 60% of its time searching for code before making changes. The bottleneck was not context window size. It was finding the right code to put in context.
Context quality over context quantity
Putting 200K tokens of code into context when you only need 5K tokens of relevant code produces worse results than sending just the 5K. Context rot means more input tokens degrade output quality. The goal is not to fill the window. It is to fill it with the right content.
Context Engineering > Context Size
The context window arms race, from 4K to 200K to 1M to 10M, misses the point. Chroma's research demonstrated that performance degrades at every length increment across all 18 models tested. Bigger windows give you capacity, not quality.
Context engineering is the practice of curating what goes into the window: selecting relevant information, removing noise, structuring content for attention patterns, and managing token budgets across multi-turn interactions. Anthropic's own documentation states that Claude "achieves state-of-the-art results on long-context retrieval benchmarks, but these gains depend on what's in context, not just how much fits."
Practical strategies
Semantic Search Before Context Loading
Find the relevant code first, then load it into context. Tools like WarpGrep use RL-trained search to find relevant code in 3.8 steps (0.73 F1) vs. 12.4 steps for baseline approaches, reducing both latency and unnecessary context consumption.
Prompt Caching for Stable Context
Cache system prompts, documentation, and reference code. Pay the write cost once, then read at 10% of input price. Keeps your effective context budget high without the cost of reprocessing.
Efficient Code Edits
Code editing is one of the highest-token operations in agent workflows. Fast Apply generates edits at 10,500 tok/s with 98% accuracy, reducing the output token cost of applying changes to files.
Subagent Isolation
Delegate search, testing, and analysis to subagents with isolated context windows. The main agent receives only the summary, keeping its context clean. Up to 10 concurrent subagents give you an effective 2M tokens across isolations.
The models with the biggest context windows are not always the best choice. A 200K window filled with precisely relevant code outperforms a 1M window filled with an entire repository. The tooling that selects what goes into context determines outcomes more than the window itself.
Frequently Asked Questions
What is Claude's context window size?
All current Claude models (Opus 4.6, Sonnet 4.6, Haiku 4.5) have a 200,000 token context window. Opus 4.6, Sonnet 4.6, Sonnet 4.5, and Sonnet 4 also support 1M tokens in beta, available to organizations in usage tier 4. The 200K window holds roughly 150,000 words or 500 pages of text.
How does Claude's context window compare to GPT-4.1 and Gemini?
GPT-4.1 has a 1M token window by default. Gemini 2.5 Pro has 1M with 2M coming. Claude's default is 200K, with 1M in beta. Raw size favors competitors, but Chroma's research found Claude models decay the slowest with increasing context length, while GPT models showed more erratic behavior and Gemini degraded earlier on complex tasks.
What is prompt caching and how much does it save?
Prompt caching lets you mark context for reuse across API calls. Cache reads cost 10% of standard input price, saving up to 90% on costs and 85% on latency. A 100K token prompt drops from 11.5s to 2.4s with caching. The break-even is two API calls for 5-minute caching, one call for any frequently used context.
Does Claude's performance degrade with longer contexts?
Yes. All LLMs exhibit context rot. The lost-in-the-middle effect means information in the center of context gets deprioritized. Claude Opus 4.6 made significant progress, scoring 76% on MRCR v2 vs. 18.5% for its predecessor, but degradation still occurs at every length increment. The practical response is context engineering: sending relevant content rather than maximizing volume.
How much code fits in Claude's 200K token context window?
Approximately 50,000-70,000 lines of code. A medium-sized project (200 files, 150 lines each) fits comfortably. A large enterprise codebase (100K+ lines) does not. For large codebases, semantic search tools that find relevant code before context loading produce better results than trying to fit everything in.
What is the pricing for Claude's 1M token context window?
Requests exceeding 200K input tokens are charged at premium rates: 2x input and 1.5x output. For Opus 4.6, that means $10/MTok input and $37.50/MTok output. For Sonnet, $6/MTok input and $22.50/MTok output. Requests under 200K use standard pricing even with the 1M beta enabled. Prompt caching multipliers stack on top of long-context rates.
Related Reading
Better Context, Not More Context
WarpGrep finds relevant code in 3.8 steps (0.73 F1) so your agent fills its context window with signal, not noise. Fast Apply generates code edits at 10,500 tok/s. Both work with Claude, GPT, Gemini, or any LLM.