What Is an LLM Context Window? The Developer's Guide (2026)

An LLM context window is the total number of tokens a model can process in a single request. This guide covers how they work, why they degrade, what the real limits are, and how to manage them in production.

February 28, 2026 · 3 min read

An LLM context window is the maximum number of tokens a model can process in one request. It is the model's working memory: your prompt, system instructions, conversation history, tool outputs, and the model's own response all share the same budget. Context windows have grown 20,000x since 2018, from 512 tokens to 10 million. But bigger does not mean better, and the number on a model card is not the number you actually get.

20,000x
Context window growth since 2018
10M
Largest context window (Llama 4 Scout)
30%+
Accuracy drop in middle of context
0.75
Words per token (approximate)

What Is a Context Window

A context window defines how much text an LLM can see at once. Every piece of information the model processes in a single request must fit within this budget: your prompt, the system instructions that define the model's behavior, any conversation history from previous turns, files or documents you include, tool call results, and the model's generated response.

The window is measured in tokens, not words or characters. When you send a request to an LLM API, the total token count of input plus output cannot exceed the context window. If it does, the request fails or the input gets truncated.

A critical point: the context window is not persistent memory. Every API call starts from scratch. The model does not "remember" previous requests. If you want continuity across turns in a chat, you must re-send the entire conversation history with each request. This is why long sessions accumulate tokens fast and why context compression becomes essential for production systems.

Context window vs memory

A 200K-token context window does not mean the model has 200K tokens of memory. It means the model can process 200K tokens right now, in this single request. Next request, it starts empty again. Conversation history, system prompts, and tool outputs must be re-sent every time.

How Tokenization Works

LLMs never see raw text. Before processing, text is converted into tokens by a tokenizer. Most modern models use Byte-Pair Encoding (BPE), an algorithm that starts with individual characters and iteratively merges the most frequent adjacent pairs until reaching a target vocabulary size, typically 30K to 100K tokens.

The result: common words like "the" become a single token, while rare or compound words get split into multiple tokens. "tokenization" might become ["token", "ization"]. Code tokenizes less efficiently than prose because special characters ({ } => //), indentation, and camelCase variable names produce more tokens per character.

Context Window~Words~Pages of Text~Lines of Code
4K tokens3,000~10 pages~2,000-3,000
32K tokens24,000~80 pages~16,000-22,000
128K tokens96,000~300 pages~50,000-70,000
1M tokens750,000~2,500 pages~400,000-550,000
10M tokens7,500,000~25,000 pages~4-5.5 million

Different models use different tokenizers, so the same text produces different token counts depending on the model. OpenAI's tiktoken, Anthropic's tokenizer, and Google's SentencePiece each split text differently. Always count tokens with the target model's tokenizer, not a generic estimate.

Context Window Sizes in 2026

Context windows have grown 20,000x in eight years. GPT-1 launched with 512 tokens in 2018. GPT-3.5 pushed to 16K. By early 2023, most models operated between 4K and 8K. Then the race accelerated: 128K (GPT-4 Turbo), 200K (Claude 3), 1M (Gemini 1.5 Pro), and now 10M (Llama 4 Scout).

YearModelContext WindowIncrease
2018GPT-1512 tokensBaseline
2020GPT-32,048 tokens4x
2023 (Mar)GPT-48,192 tokens16x
2023 (Nov)GPT-4 Turbo128K tokens250x
2024 (Mar)Claude 3200K tokens390x
2024 (Feb)Gemini 1.5 Pro1M tokens1,950x
2025 (Apr)Llama 4 Scout10M tokens19,500x

For a full breakdown of every model's context window, pricing, and max output tokens, see our LLM Context Window Comparison table.

Budget Tier

DeepSeek V3 (128K), GPT-4.1 Mini (1M), Gemini 2.5 Flash (1M). Best token-per-dollar ratio for high-volume workloads.

Premium Tier

Claude Opus 4.6 (200K, 1M beta), GPT-5.2 (400K), Gemini 2.5 Pro (1M). Strongest reasoning, highest per-token cost.

Maximum Context

Llama 4 Scout (10M), Grok 4 (2M). Largest windows available, but effective context is far smaller than advertised.

Input vs Output: The Shared Budget

The context window is a shared budget. Input tokens (what you send) and output tokens (what the model generates) both count against the same limit. Most models set separate caps for each.

ModelTotal ContextMax OutputEffective Input Limit
GPT-5.2400K128K272K
Claude Opus 4.6200K (1M beta)64K136K (936K beta)
Gemini 2.5 Flash1M8K992K
GPT-4.11M32K968K
DeepSeek R1128K64K64K

This distinction matters for different workloads. A model with 1M context but 8K max output (Gemini 2.5 Flash) can ingest an entire codebase but generates short responses. A model with 400K context and 128K max output (GPT-5.2) generates longer responses, which matters for multi-file code edits or long-form content.

Why max output matters for coding agents

A coding agent doing a multi-file refactor might need to output 500+ lines of code across several files in one turn. If the model's max output is 8K tokens (~200-300 lines), it must split the work across multiple turns, each carrying the full conversation history. More turns means more accumulated context means more context rot.

The Lost-in-the-Middle Problem

Having a large context window means nothing if the model cannot use it uniformly. Liu et al.'s research (Stanford, published in TACL 2024) demonstrated that LLM performance follows a U-shaped curve across the context. Models attend strongly to the beginning and end of the input but drop 30%+ in accuracy on information placed in the middle.

PositionAccuracyWhat Happens
Start (Position 1)~75%Primacy bias: strong attention to early tokens
Middle (Position 10)~55%Blind spot: model loses track of middle content
End (Position 20)~72%Recency bias: strong attention to late tokens

The root cause is architectural. Rotary Position Embedding (RoPE), used by most modern LLMs, introduces a long-term decay effect that naturally de-emphasizes middle positions. This is not a bug that will be patched. It is a structural property of the attention mechanism.

For practical use, this means: if your most important information lands in the middle third of a long prompt, the model is significantly less likely to use it correctly. This is especially damaging for coding agents, where the relevant file is often discovered mid-search and sits in the middle of accumulated context. Read more in our deep dive on the lost-in-the-middle effect.

Context Window Overhead

The advertised context window is not all yours. Several sources of overhead consume tokens before your actual content:

Where your context budget actually goes

Total context window:                    200,000 tokens
─────────────────────────────────────────────────────
System prompt (behavior, tools, format):  -3,000 tokens
Conversation history (15 turns):         -25,000 tokens
Tool call results (file reads, searches): -12,000 tokens
Reserve for model output:                -64,000 tokens
─────────────────────────────────────────────────────
Available for your actual question:       96,000 tokens

System prompts define the model's behavior, available tools, output formatting, and constraints. In production, these run 2,000-5,000 tokens. They repeat with every API call in a multi-turn conversation.

Conversation history grows linearly with each turn. By turn 15 of a chat, you could have 25,000-40,000 tokens of history before the user asks anything new.

Tool outputs from coding agents are the biggest consumer. Reading files, running searches, and executing commands inject thousands of tokens per action. A single file read might add 500-2,000 tokens. Ten file reads across a debugging session adds 5,000-20,000 tokens that stay in the window permanently.

2-5K
Typical system prompt (tokens)
25-40K
History after 15 turns
60%+
Agent time spent retrieving context
10x
Token variance between agents on same task

Why Bigger Windows Don't Solve the Problem

The intuitive response to context limitations is: make the window bigger. The research says this does not work.

Performance Degrades at Every Length

Chroma's research tested 18 frontier models and found that every model degrades as context grows. Not just near the limit. At every increment. A model with 1M tokens of capacity still shows context rot at 50K.

RULER Benchmark Reality Check

The RULER benchmark tests retrieval accuracy at increasing context lengths. The gap between advertised context and effective context is often enormous:

ModelClaimed ContextScore @ 4KScore @ 128KDrop
Gemini 1.5 Pro1M96.794.4-2.3 pts
GPT-4-1106128K96.681.2-15.4 pts
Llama 3.1-70B128K96.566.6-29.9 pts
Mixtral-8x22B64K95.631.7-63.9 pts

Mixtral-8x22B, despite advertising a 64K window, produces near-random results at 128K. You are not using the same model at every context length. The model at 128K is measurably worse than the model at 4K. For more data, see our full context window comparison.

Compression Beats Expansion

CompLLM research demonstrated that 2x compressed context surpasses uncompressed performance on long sequences. The mechanism is simple: removing noise improves signal-to-noise ratio. The retrieve-then-solve approach improved Mistral from 35.5% to 66.7% accuracy by selecting relevant context instead of sending everything. Less input, better output.

Context Windows for Coding Agents

Coding agents stress context windows harder than any other use case. A typical agentic coding session accumulates context like sediment:

Context accumulation in a coding agent session

Turn 1:  Read issue, system prompt                →   3,500 tokens
Turn 2:  Grep for relevant files, read 4 matches  →  +8,000 tokens
Turn 3:  Need more context, read 3 related files   →  +6,000 tokens
Turn 4:  Backtrack, read test files for patterns    →  +5,000 tokens
Turn 5:  Found the right file, but now carrying     →  22,500 tokens
         ↑ 80% of this is irrelevant search debris

Cognition measured this directly: agents spend over 60% of their first turn retrieving context, not reasoning or editing. An OpenReview study found that some agents consume 10x more tokens than others on equivalent tasks. The variance was driven by search efficiency, not coding ability.

The longer a coding session runs, the worse it gets. Research on long-running agents shows that every agent's success rate decreases after 35 minutes, and doubling task duration quadruples the failure rate. The cause is accumulated context noise.

For more on how the Claude Code context window handles this, and how context engineering applies to agentic workflows, see our agentic context engineering guide.

Managing Context in Production

The solution to context window limitations is not bigger windows. It is better context management. Three strategies work:

Context Compression

Reduce token count by 50-70% before sending to the model. Morph Compact keeps every surviving sentence verbatim at 98% accuracy and 3,300+ tok/s. Less noise means better output.

Context Isolation

Delegate search to subagents running in their own context windows. The main model never sees exploration dead-ends. Anthropic's multi-agent system improved performance by 90% using this approach.

Selective Retrieval

Send only relevant code to the model, not entire files. WarpGrep returns precise file and line ranges, reducing context rot by 70% while speeding up task completion by 40%.

Hidden Pricing Cliff at 200K Tokens

Beyond quality, context management has a direct cost impact. Both Anthropic and Google charge 2x on input tokens when any part of a request exceeds 200K tokens. This surcharge applies to every token in the request, not just the overage. Crossing 200K by one token doubles your input cost for the entire request.

OpenAI charges no surcharge at any length. For workloads that consistently exceed 200K tokens, this is a meaningful cost difference. At 1 billion tokens per month, the gap between DeepSeek V3 ($140) and Claude Opus 4.6 ($5,000+) is 35x before surcharges. With surcharges on long-context requests, Opus can reach $10,000. See our full pricing breakdown.

Morph Compact for Context Reduction

Morph Compact reduces context by 50-70% while preserving every surviving sentence word-for-word. No paraphrasing, no summarization, no hallucination risk. At 3,300+ tokens per second with 98% verbatim accuracy, it pays for itself by cutting the token cost on every subsequent request in a session. For teams spending $3,000-5,000 per billion tokens on premium models, compacting context first can cut that to $1,000-2,500 while improving output quality.

50-70%
Context reduction (Compact)
98%
Verbatim accuracy
3,300+
Tokens per second
70%
Context rot reduction (WarpGrep)

Frequently Asked Questions

What is an LLM context window?

An LLM context window is the maximum number of tokens a model can process in a single request. It functions as working memory: your prompt, system instructions, conversation history, and the model's response all share the same budget. As of 2026, context windows range from 128K tokens (DeepSeek, Mistral) to 10 million (Llama 4 Scout).

How many words fit in a context window?

One token is roughly 0.75 English words, or about 4 characters. A 128K token context window holds about 96,000 words (300 pages of text) or 50,000-70,000 lines of code. A 1 million token window holds roughly 750,000 words. Code tokenizes less efficiently than prose because of special characters and indentation.

What is the difference between context window and max output tokens?

The context window is the total token budget shared between input and output. Max output tokens is the ceiling on how much the model can generate. GPT-5.2 has a 400K context window with 128K max output, meaning input is capped at 272K tokens. A model with large context but small max output can read a lot but writes short responses per turn.

Why do LLMs perform worse with longer context?

Three factors cause degradation. The lost-in-the-middle effect means models attend poorly to information in the middle of the input (30%+ accuracy drop). Attention dilution means quadratic scaling creates exponentially more pairwise relationships. And semantically similar distractors interfere with retrieval. Chroma tested 18 frontier models and every one degraded.

What is the lost-in-the-middle problem?

Research by Liu et al. (Stanford, TACL 2024) showed that LLM accuracy follows a U-shaped curve. Models attend strongly to tokens at the beginning and end but drop 30%+ on information in the middle of the context. For coding agents, this means relevant code found mid-search may sit in the model's blind spot.

Is a bigger context window always better?

No. Performance degrades at every context length increment, not just near the limit. RULER benchmarks show GPT-4-1106 drops 15 points from 4K to 128K. CompLLM research showed 2x compressed context surpasses uncompressed performance on long sequences because removing noise improves signal quality.

How do tokens work in LLMs?

Tokens are sub-word units created by a tokenizer, typically using Byte-Pair Encoding (BPE). BPE starts with individual characters and merges the most frequent adjacent pairs until reaching a vocabulary of 30K-100K tokens. Different models use different tokenizers, so the same text produces different token counts. Always count tokens with the target model's specific tokenizer.

How can I reduce context window usage without losing quality?

Morph Compact reduces context by 50-70% while keeping every surviving sentence verbatim (98% accuracy, 3,300+ tok/s). Research shows compressed context can improve output quality by reducing noise. Subagent architectures that isolate search into separate context windows reduce context rot by 70%.

Use Less Context. Get Better Results.

Morph Compact reduces context by 50-70% while keeping every surviving sentence verbatim. Cut your token costs, stay under pricing surcharge thresholds, and improve output quality at the same time.