LLM Token Limits: Every Model's Context Window, Compared (Feb 2026)

Complete comparison of LLM token limits for GPT-5, Claude Opus 4.6, Gemini 3 Pro, Llama 4, and more. Context window sizes, max output tokens, pricing per million tokens, and the hidden surcharges that kick in at 200K tokens.

February 27, 2026 · 2 min read

Every LLM has a token limit. That limit determines how much text, code, and conversation history the model can process in a single request. This page is the definitive reference for every major model's context window, max output tokens, pricing, and the hidden surcharges most comparison tables leave out.

10M
Largest context window (Llama 4 Scout)
2-8x
More tokens for CJK vs English
2x
Surcharge above 200K tokens (Anthropic/Google)
~30%
Accuracy drop in the middle of long contexts

Every LLM's Token Limit (February 2026)

The table below covers every major LLM available through API as of February 2026. Context window is the total input capacity. Max output is the longest response the model can generate in a single call. Pricing is per million tokens.

ModelContext WindowMax OutputInput $/MOutput $/M
GPT-5.2 (OpenAI)400K128K$1.75$14.00
GPT-5 (OpenAI)400K128K$1.25$10.00
GPT-5 nano (OpenAI)400K128K$0.05$0.40
o3 (OpenAI)200K100K$0.40$1.60
Claude Opus 4.6 (Anthropic)200K (1M beta)64K$5.00$25.00
Claude Sonnet 4.6 (Anthropic)200K (1M beta)64K$3.00$15.00
Gemini 2.5 Pro (Google)1M64K$1.25$10.00
Gemini 3 Pro (Google)2M---
Grok 3 (xAI)131K-$3.00$15.00
Llama 4 Scout (Meta)10M-Free-
Llama 4 Maverick (Meta)1M-Free-
DeepSeek R1 (DeepSeek)128K64K$0.55$2.19
Codestral (Mistral)256K-$0.30$0.90

Context window != usable context

A model advertising 200K tokens does not mean it performs well at 200K tokens. Research consistently shows performance degradation well before the stated limit. Models claiming 200K context degrade noticeably around 130K tokens. The stated window is a ceiling, not a performance guarantee.

A few things stand out. OpenAI has the most consistent max output across models (128K). Anthropic and Google offer the largest context windows from commercial providers, but both apply significant pricing surcharges above 200K tokens. Meta's Llama 4 models have the largest raw context windows (1M-10M) but are primarily for self-hosted deployments. DeepSeek R1 and Codestral offer the lowest per-token pricing for API access.

Context Window vs. Max Output: Why Both Matter

The context window is the total number of tokens the model can process in a single request, including both your input and its output. The max output limit caps how long the model's response can be. These are separate constraints, and both can block you.

GPT-5 has a 400K context window but a 128K max output. If you send 380K tokens of input, the model can only generate a 20K token response (400K - 380K). If you need a long output, you must leave room in the context window for it. Claude Opus 4.6 has a 200K context window with 64K max output, so sending more than 136K input tokens means the model's response gets cut short even if there's technically room in the window.

For coding agents and chat applications, the max output limit rarely matters since responses are typically under 4K tokens. But for code generation, document writing, and batch processing tasks, max output becomes the binding constraint. Plan your input budget accordingly.

Open-Weight Models: Context at Scale

Meta's Llama 4 Scout supports 10M tokens, the largest context window of any model in this comparison. But there's a catch: you need the infrastructure to run it. Open-weight models with huge context windows require significant GPU memory. The 10M token window is a theoretical maximum that depends on your hardware provisioning.

For self-hosted deployments, the practical context limit is determined by your available GPU memory, not the model's architecture. A model that supports 10M tokens on paper might only handle 500K on your specific hardware configuration. Always benchmark with your actual deployment before committing to a context window size.

How Tokenization Works

Tokens are not words. A token is a chunk of text that the model processes as a single unit. All major LLMs use some variant of Byte Pair Encoding (BPE), an algorithm that builds a vocabulary of common character sequences from a training corpus. Common words become single tokens. Rare words get split into multiple tokens.

ProviderTokenizerVocab SizeNotes
OpenAI (GPT-5)o200k_base200KMost efficient for English. Successor to cl100k_base
Anthropic (Claude)Proprietary BPE-Optimized for code and multilingual text
Meta (Llama 4)SentencePiece BPE128KOpen-source tokenizer with broad language coverage
Google (Gemini)SentencePiece256KLarge vocab for multilingual efficiency
MistralSentencePiece BPE32KCompact vocabulary, fast tokenization

Token-to-Text Ratios

The relationship between tokens and human-readable text varies by language and content type. These ratios determine how much actual content fits within a given token limit.

~4
Characters per token (English)
~0.7
Words per token (English)
1.5-2x
More tokens for code vs prose
2-8x
More tokens for CJK languages

In practice, this means GPT-5's 400K token window holds roughly 280K English words or about 560 pages of text. But the same 400K tokens holds significantly less code, and dramatically less CJK text. If your application handles multilingual input, the effective context window is much smaller than the headline number suggests.

Token counting example (Python)

import tiktoken

# GPT-5 uses o200k_base
enc = tiktoken.get_encoding("o200k_base")

english = "The quick brown fox jumps over the lazy dog"
code = "function handleAuth(req: Request): Promise<Response> {"
chinese = "快速的棕色狐狸跳过了懒狗"

print(f"English: {len(enc.encode(english))} tokens")  # ~9 tokens
print(f"Code:    {len(enc.encode(code))} tokens")      # ~12 tokens
print(f"Chinese: {len(enc.encode(chinese))} tokens")   # ~11 tokens

# Same semantic content, very different token counts
# English: ~1.3 tokens/word
# Code:    ~1.7 tokens/word
# Chinese: ~2.2 tokens/character

GPT-5's tokenizer is more efficient

GPT-5's o200k_base tokenizer has twice the vocabulary of GPT-4's cl100k_base. Larger vocabularies mean more common sequences get single-token representations, reducing total token count for the same text. If you're counting tokens with the old tokenizer, your estimates will be too high.

How BPE Tokenization Works

Byte Pair Encoding starts with individual bytes and iteratively merges the most frequent adjacent pairs. After training on a large corpus, the tokenizer has a fixed vocabulary of subword units. Common English words like "the" and "function" become single tokens. Rare words get split: "tokenization" might become "token" + "ization", and a rare proper noun might be split into individual characters.

This is why token counts are unpredictable without actually running the tokenizer. A 10-word sentence might be 10 tokens if every word is common, or 25 tokens if it contains technical jargon, URLs, or code. Whitespace and punctuation also consume tokens. A JSON object with many brackets, colons, and quotes uses more tokens than the equivalent plain text.

Practical Implications for Token Budgeting

When building applications against token-limited APIs, rough estimates break down in edge cases. Here are the patterns that consume more tokens than expected:

  • URLs and file paths: A URL like https://api.example.com/v2/users/12345 can consume 15+ tokens due to slashes, dots, and numbers being separate tokens.
  • JSON and structured data: Brackets, quotes, colons, and commas each consume tokens. A compact JSON object uses roughly 2x the tokens of equivalent plain text.
  • Base64 and encoded strings: Encoded binary data tokenizes very poorly. A base64 image embedded in context can consume 10-50x more tokens than describing what the image contains.
  • Stack traces and error logs: Repetitive paths and line numbers inflate token counts. A single Java stack trace can easily consume 500+ tokens.

What Happens When You Hit the Limit

No major LLM provider silently truncates your input. If your request exceeds the token limit, you get an error. But the error handling and available workarounds differ by provider.

Provider Error Behavior

ProviderError TypeDetails ProvidedAutomatic Truncation?
OpenAIHTTP 400Exact token count in error messageNo
AnthropicValidation errorToken details in responseNo
Google400 Bad RequestToken count exceeded messageOptional (countTokens API)

Typical OpenAI token limit error

{
  "error": {
    "message": "This model's maximum context length is 400000 tokens.
               However, your messages resulted in 412847 tokens.
               Please reduce the length of the messages.",
    "type": "invalid_request_error",
    "code": "context_length_exceeded"
  }
}

Truncation Strategies

When your input exceeds the limit, you need to cut it down. Three common strategies, each with different tradeoffs:

StrategyHow It WorksBest ForRisk
Stop at limitDrop everything past the token ceilingSimple batch processingLoses recent context (often most relevant)
Truncate middleKeep start + end, remove the middleConversations with important system prompt and recent messagesLoses middle context (citations, details)
Rolling windowDrop oldest messages, keep most recentChat applications, ongoing sessionsLoses early context (setup, instructions)

All three strategies are lossy. You throw away information and hope the model doesn't need it. Context compression is the alternative: reduce token count while preserving the information content.

Output Token Limits

A separate but related limit: max output tokens. Even if your input fits within the context window, the model's response has its own ceiling. GPT-5 caps output at 128K tokens. Claude caps at 64K. If the model's response exceeds this limit, it gets truncated mid-sentence with a finish_reason: "length" flag in the response.

Detecting output truncation

# Check if the response was cut short due to output token limit
response = client.chat.completions.create(
    model="gpt-5",
    messages=[...],
    max_tokens=4096  # Set explicit output limit
)

if response.choices[0].finish_reason == "length":
    # Response was truncated — need to continue or reduce scope
    print("Output hit token limit, response is incomplete")
elif response.choices[0].finish_reason == "stop":
    # Response completed naturally
    print("Full response received")

For most chat and coding applications, output limits are not the bottleneck since responses are typically 1-4K tokens. But for code generation, document drafting, and data transformation tasks, you can hit the output ceiling before the context window matters. Set max_tokens explicitly to control output length and catch truncation early.

The Hidden Cost: Long-Context Pricing Tiers

Most LLM pricing pages show a single per-token rate. What they bury in the footnotes: Anthropic and Google both apply steep surcharges when your request crosses 200K tokens. And the surcharge applies to the entire request, not just the tokens above the threshold.

2x
Anthropic input surcharge over 200K
1.5x
Anthropic output surcharge over 200K
2x
Google surcharge over 200K
ProviderStandard InputOver 200K InputStandard OutputOver 200K Output
Anthropic (Sonnet 4.6)$3.00$6.00 (2x)$15.00$22.50 (1.5x)
Anthropic (Opus 4.6)$5.00$10.00 (2x)$25.00$37.50 (1.5x)
Google (Gemini 2.5 Pro)$1.25$2.50 (2x)$10.00$20.00 (2x)
OpenAI (GPT-5)$1.25$1.25 (no surcharge)$10.00$10.00 (no surcharge)

The surcharge is all-or-nothing

If your Anthropic request is 201K tokens, the 2x input rate applies to all 201K tokens, not just the 1K over the threshold. A request at 199K tokens costs $0.60 (Sonnet). A request at 201K tokens costs $1.21. That's a 2x jump for 2K extra tokens. Stay under 200K.

The Math on Compression ROI

Consider a coding agent session using Claude Sonnet 4.6 that accumulates 250K input tokens. Without compression, the entire request hits the 2x tier: 250K x $6.00/M = $1.50 per request. With Morph Compact reducing tokens by 50%, the input drops to 125K tokens at the standard rate: 125K x $3.00/M = $0.38. That's a 75% cost reduction, not just from fewer tokens but from avoiding the surcharge tier entirely.

Cost comparison: with and without compression

# Claude Sonnet 4.6 — coding agent session, 250K input tokens

# WITHOUT compression:
# Entire request hits 2x surcharge tier (>200K)
input_cost  = 250_000 * ($6.00 / 1_000_000)  = $1.50
output_cost = 4_000   * ($22.50 / 1_000_000) = $0.09
total = $1.59 per request

# WITH Morph Compact (50% reduction):
# 125K tokens — stays under 200K threshold
input_cost  = 125_000 * ($3.00 / 1_000_000)  = $0.38
output_cost = 4_000   * ($15.00 / 1_000_000) = $0.06
total = $0.44 per request

# Savings: $1.15 per request (72% reduction)
# Over 100 agent sessions/day: $115/day, $3,450/month

The Lost-in-the-Middle Problem

Even if you stay within the token limit, long contexts degrade model performance. The "lost-in-the-middle" phenomenon, documented across every major model family, shows that LLMs attend well to the beginning and end of their input but lose accuracy for information positioned in the center.

30%+
Accuracy drop for mid-positioned content
~130K
Effective limit for 200K context models
U-shaped
Attention curve across context length

This means a model with a 200K context window does not give you 200K tokens of reliable working memory. Research consistently shows degradation starting well before the stated limit. A 200K model starts showing measurable quality loss around 130K tokens. The tokens in the middle of a long prompt are the most likely to be missed or misinterpreted.

For coding agents, this is especially problematic. An agent accumulates context over many turns: file reads, grep results, tool outputs, error traces, and prior conversation. The critical piece of information from ten turns ago might be sitting in the exact middle of the context window, right where the model is least likely to attend to it.

Context rot compounds over turns

As context rot research shows, model performance degrades as input length increases, even when the window is not full. Every irrelevant token makes the model worse at attending to the tokens that matter. The solution is not a bigger context window. It's keeping the context clean.

This is why context compression matters even when you have space left in the window. Compression is not just about fitting more in. It's about removing noise so the model can focus on signal. A 100K context with high information density outperforms a 200K context diluted with irrelevant tool outputs.

Practical Impact on Agent Architectures

The lost-in-the-middle effect has direct architectural consequences. If your agent reads 20 files into context across a multi-step task, the files read in the middle of the session are the ones most likely to be forgotten. This creates a pattern where agents succeed on the first and last steps of a task but fail on intermediate steps that depend on mid-context information.

Several mitigation strategies exist beyond compression. Placing critical information at the beginning or end of the prompt helps. Re-inserting important context at decision points forces the model to attend to it. And reducing total context length shifts all content closer to the attention-favored positions at the edges.

Mitigating lost-in-the-middle in agent loops

// Anti-pattern: dump everything into context and hope for the best
const messages = [systemPrompt, ...allToolOutputs, userQuery];

// Better: compact tool outputs and re-insert critical context
const compactedOutputs = await Promise.all(
  toolOutputs.map(output =>
    output.tokens > 500 ? morph.compact(output.content) : output.content
  )
);

// Place the current task description near the end (attention-favored position)
const messages = [
  systemPrompt,
  ...compactedOutputs,
  { role: "user", content: `Current task: ${taskDescription}\n\nRelevant context re-stated: ${criticalContext}` }
];

Strategies for Working Within Token Limits

Six practical approaches for staying within token limits without losing the information your application needs.

1. Chunking

Split large documents into smaller chunks that fit within the context window. Process each chunk independently, then combine results. Works well for summarization and extraction tasks where each chunk is self-contained. Falls apart when the answer depends on information spread across multiple chunks.

2. RAG (Retrieval-Augmented Generation)

Instead of stuffing the entire knowledge base into the context window, embed documents into a vector store and retrieve only the relevant chunks at query time. This keeps token usage proportional to the query, not the corpus size. The quality ceiling depends on the retrieval step: if the right chunks are not retrieved, the model cannot use them.

3. Sliding Window / Rolling Context

For chat applications, drop the oldest messages when the conversation exceeds the token limit. The model always sees the most recent context. The tradeoff: early instructions and context are lost. Partially mitigated by keeping the system prompt pinned and summarizing dropped messages.

4. Summarization

Use the LLM itself to summarize prior context into a shorter form. Anthropic's Claude Code and OpenAI's Codex both use this approach for long coding sessions. The risk: summarization rewrites the original text, and rewriting introduces the possibility of altered code, mangled file paths, or hallucinated details.

5. Verbatim Compaction

Delete low-signal tokens while keeping every surviving sentence word-for-word identical to the original. Morph Compact achieves 50-70% token reduction at 3,300+ tokens per second with 98% verbatim accuracy. Unlike summarization, nothing is rewritten, so there is zero hallucination risk in the compacted output. The tradeoff: lower compression ratio than summarization (50-70% vs 70-90%).

6. Hybrid Approaches

Combine strategies for different parts of the input. Use RAG for the knowledge base. Compact tool outputs inline. Summarize only the oldest turns of conversation history where exact details matter less. The most effective production systems layer multiple strategies rather than relying on a single approach.

Hybrid approach: RAG + inline compaction

// 1. RAG: retrieve only relevant documents (not the whole corpus)
const relevantDocs = await vectorStore.query(userQuery, { topK: 5 });

// 2. Compact long tool outputs inline
const toolResults = await Promise.all(
  pendingTools.map(async (tool) => {
    const result = await executeTool(tool);
    if (estimateTokens(result.output) > 500) {
      return await morph.compact(result.output);  // verbatim compaction
    }
    return result.output;
  })
);

// 3. Summarize oldest conversation turns (exact details less critical)
const recentTurns = conversation.slice(-10);  // keep recent turns verbatim
const oldSummary = await summarize(conversation.slice(0, -10));

// 4. Assemble context within token budget
const messages = [
  { role: "system", content: systemPrompt },
  { role: "assistant", content: oldSummary },       // summarized old context
  ...recentTurns,                                    // exact recent context
  { role: "user", content: relevantDocs.join("\n") }, // RAG results
  ...toolResults.map(r => ({ role: "tool", content: r }))
];
StrategyToken ReductionInformation LossHallucination Risk
ChunkingHigh (per-chunk)Context between chunksNone
RAGVery highDepends on retrieval qualityNone
Sliding windowVariableOldest context droppedNone
Summarization70-90%Details rewrittenMedium
Verbatim compaction50-70%Low-signal tokens removedZero
HybridHighestMinimizedDepends on mix

Token Counting Tools

You need to know your token count before sending a request, not after you get a 400 error. These tools let you count tokens locally for accurate estimation.

Python

Python token counting libraries

# tiktoken — OpenAI's official tokenizer (fastest)
pip install tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-5")
tokens = enc.encode("Your text here")
print(f"{len(tokens)} tokens")

# token-counter — multi-provider support
pip install token-counter
from token_counter import TokenCounter
counter = TokenCounter(model="claude-sonnet-4-6")
count = counter.count("Your text here")

# HuggingFace transformers — Llama, Mistral, open models
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-4-Scout")
tokens = tokenizer.encode("Your text here")
print(f"{len(tokens)} tokens")

JavaScript / TypeScript

JavaScript token counting libraries

// js-tiktoken — tiktoken port for JS (works in browser + Node)
import { encodingForModel } from "js-tiktoken";
const enc = encodingForModel("gpt-5");
const tokens = enc.encode("Your text here");
console.log(`${tokens.length} tokens`);

// @xenova/transformers — Llama, Mistral, open models in JS
import { AutoTokenizer } from "@xenova/transformers";
const tokenizer = await AutoTokenizer.from_pretrained("meta-llama/Llama-4-Scout");
const { input_ids } = await tokenizer("Your text here");
console.log(`${input_ids.length} tokens`);

Online Tools

For quick checks without code: OpenAI's Tokenizer visualizes token boundaries for GPT models. It shows exactly where the tokenizer splits your text, which is useful for understanding why some inputs use more tokens than expected.

Pre-request Token Counting in Production

In production systems, count tokens before sending the API request. This lets you truncate, compress, or split the request proactively rather than handling errors reactively. Most SDKs provide token counting methods.

Pre-request token budget management

import tiktoken

MODEL = "gpt-5"
MAX_CONTEXT = 400_000
MAX_OUTPUT = 4_096  # reserve for response
INPUT_BUDGET = MAX_CONTEXT - MAX_OUTPUT

enc = tiktoken.encoding_for_model(MODEL)

def count_messages_tokens(messages):
    """Count tokens across all messages including overhead."""
    total = 0
    for msg in messages:
        total += 4  # message framing overhead
        total += len(enc.encode(msg["content"]))
    total += 2  # assistant reply priming
    return total

# Check before sending
token_count = count_messages_tokens(messages)
if token_count > INPUT_BUDGET:
    # Compact instead of truncating
    messages = compress_context(messages, target=INPUT_BUDGET)

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    max_tokens=MAX_OUTPUT
)

Count tokens for the right model

Token counts vary between models because they use different tokenizers. Text that's 1,000 tokens on GPT-5 (o200k_base) might be 1,200 tokens on Claude or 900 tokens on Gemini. Always count using the tokenizer for the model you're actually calling.

Frequently Asked Questions

What is the largest LLM context window available in 2026?

Llama 4 Scout from Meta has the largest context window at 10 million tokens. Gemini 3 Pro from Google supports 2 million tokens. GPT-5 and GPT-5.2 from OpenAI support 400K tokens. Larger context windows do not automatically mean better performance. Models degrade well before their stated limits, and providers like Anthropic and Google apply 2x pricing surcharges above 200K tokens.

How many tokens is 1,000 words?

In English, 1,000 words is roughly 1,300 to 1,500 tokens using modern BPE tokenizers. One token averages about 4 characters or 0.7 words. Code tokenizes less efficiently at 1.5 to 2.0 tokens per word. CJK languages consume 2 to 8 times more tokens than English for equivalent content.

What happens when you exceed an LLM's token limit?

OpenAI returns an HTTP 400 error with the exact token count. Anthropic returns a validation error with token details. Neither provider silently truncates your input. You need to reduce the input length before retrying, either by truncating, using a rolling window, or compressing the context.

Why do LLMs perform worse with longer contexts?

Models exhibit the "lost-in-the-middle" problem: 30%+ accuracy drop for information positioned in the middle of long contexts. The model attends well to the beginning and end but struggles with content buried in the center. Additionally, attention quality degrades as input length increases. Models claiming 200K context windows often degrade noticeably around 130K tokens.

Do any LLM providers charge more for long contexts?

Yes. Anthropic charges 2x input and 1.5x output above 200K tokens, applied to all tokens in the request, not just the overflow. Google applies a similar 2x surcharge above 200K with the same all-or-nothing pricing. OpenAI does not have surcharge tiers. Staying under 200K is a significant cost optimization, and Morph Compact can keep you there with 50-70% token reduction.

How can I reduce token usage without losing information?

Morph Compact reduces token count by 50-70% through verbatim compaction: it deletes low-signal tokens while keeping every surviving sentence word-for-word identical to the original. Unlike summarization, there is zero hallucination risk. It runs at 3,300+ tokens per second with 98% verbatim accuracy. Other strategies include chunking, RAG, and rolling windows, but these discard information rather than compressing it. See the LLM cost optimization guide for detailed cost strategies.

Stay Under the Token Limit Without Losing Context

Morph Compact reduces token count by 50-70% through verbatim compaction. No summarization, no hallucination risk. Every surviving sentence is word-for-word identical to the original. Stay under Anthropic and Google's 200K surcharge threshold.