TL;DR
Opus 4.6 is Anthropic's flagship model, released February 5, 2026. Model ID: claude-opus-4-6. It scores 80.8% SWE-bench Verified, 68.8% ARC-AGI-2, and 91.3% GPQA Diamond. 200K default context (1M in beta), 128K max output, adaptive thinking that scales reasoning depth per task. It powers Claude Code and introduced Agent Teams for multi-agent parallel workflows.
What it is
Anthropic's most intelligent model. Optimized for coding, agentic tasks, and expert-level reasoning. Default backbone of Claude Code. $5/$25 per million tokens input/output.
Why it matters
ARC-AGI-2 jumped from 37.6% to 68.8%, nearly doubling abstract reasoning capability. Agent Teams let multiple Opus instances coordinate on a single codebase in parallel. This is the model you escalate to when Sonnet isn't enough.
Model ID
claude-opus-4-6 in the API. Available on Anthropic's API, Amazon Bedrock, Google Vertex AI, and Microsoft Azure Foundry.
Key Specs
| Specification | Value | Notes |
|---|---|---|
| Model ID | claude-opus-4-6 | API identifier |
| Release date | February 5, 2026 | |
| Context window | 200K tokens (1M beta) | Beta via context-1m-2025-08-07 header |
| Max output tokens | 128,000 | Double Opus 4.5's 64K |
| Thinking mode | Adaptive thinking | Dynamic reasoning depth per task |
| Input pricing | $5 / 1M tokens | 67% cheaper than Opus 4.1 ($15) |
| Output pricing | $25 / 1M tokens | 67% cheaper than Opus 4.1 ($75) |
| Fast mode | $30 / $150 per 1M tokens | 6x premium for faster output |
| Batch API | 50% savings | For non-latency-sensitive workloads |
| Prompt caching | Up to 90% input savings | Cache repeated prefixes |
| Agent Teams | Supported | Multi-agent parallel coordination |
| Context compaction | Supported (beta) | Auto-summarize to extend conversations |
Benchmarks
Opus 4.6 leads or matches every frontier model on coding and reasoning benchmarks as of March 2026. The standout result is ARC-AGI-2, where it nearly doubled the previous generation's score, suggesting a fundamental improvement in abstract reasoning rather than incremental tuning.
| Benchmark | Opus 4.6 | Sonnet 4.6 | GPT-5.2 | Gemini 3 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 80.8% | 79.6% | 80.0% | 76.2% |
| GPQA Diamond | 91.3% | 74.1% | 93.2% | 91.9% |
| ARC-AGI-2 | 68.8% | N/A | 54.2% | 45.1% |
| Context window | 200K (1M beta) | 200K (1M beta) | 128K | 1M |
| Max output | 128K | 64K | 64K | 65K |
| Input / 1M tokens | $5 | $3 | $15 | $3.50 |
SWE-bench Verified: 80.8%
SWE-bench Verified tests real GitHub issue resolution: read a production codebase, understand the bug, write a patch, pass the test suite. Opus 4.6's 80.8% essentially matches Opus 4.5's 80.9%. Anthropic prioritized reasoning depth over incremental coding gains in this release.
ARC-AGI-2: 68.8%
This is the number that matters most. ARC-AGI-2 measures novel pattern recognition, the kind of abstract reasoning that previous models struggled with. Opus 4.5 scored 37.6%. Opus 4.6 scores 68.8%. A 31-point jump. GPT-5.2 Pro scores 54.2%. Gemini 3 Pro (Deep Thinking) scores 45.1%. Nothing else is close.
GPQA Diamond: 91.3%
PhD-level scientific reasoning across physics, chemistry, and biology. Opus 4.6 scores 91.3%, up from 87.0% on Opus 4.5. GPT-5.2 Pro leads this benchmark at 93.2%, with Gemini 3 Pro at 91.9%. The gap between Opus and Sonnet (91.3% vs 74.1%) is 17 points, the largest capability split between the two models.
Other Benchmarks
Opus 4.6 leads Terminal-Bench 2.0 (agentic coding evaluation) and Humanity's Last Exam (complex multidisciplinary reasoning) among all frontier models. On GDPval-AA, which measures economically valuable knowledge work in finance, legal, and other domains, Opus 4.6 outperforms GPT-5.2 by 144 Elo points.
What Changed from Opus 4.5
| Dimension | Opus 4.5 | Opus 4.6 |
|---|---|---|
| ARC-AGI-2 | 37.6% | 68.8% (+31.2 points) |
| GPQA Diamond | 87.0% | 91.3% (+4.3 points) |
| SWE-bench Verified | 80.9% | 80.8% (essentially flat) |
| Max output tokens | 64K | 128K (doubled) |
| Thinking mode | Extended thinking (fixed) | Adaptive thinking (dynamic) |
| Agent support | Single agent | Agent Teams (multi-agent parallel) |
| Context compaction | Not available | Available (beta) |
| Fast mode | Not available | Available ($30/$150 per 1M) |
| Pricing (input/output) | $15 / $75 | $5 / $25 (67% cheaper) |
The pattern is clear: Anthropic held coding performance flat while dramatically improving reasoning (ARC-AGI-2 doubled), cutting price by 67%, doubling output length, and adding the platform features (adaptive thinking, compaction, Agent Teams) that make Opus viable for long-running agentic workflows.
Adaptive Thinking
Previous Claude models used extended thinking: a fixed reasoning phase before every response, regardless of task complexity. A simple "rename this variable" got the same thinking budget as "refactor this authentication system." Wasteful.
Opus 4.6 replaces this with adaptive thinking. The model decides, per request, how much reasoning the task requires. At the default high effort level, it almost always thinks. At medium or low, it skips reasoning for simple tasks entirely.
Enabling adaptive thinking
// Recommended: let the model decide
thinking: { type: "adaptive" }
// Or control effort explicitly
thinking: { type: "adaptive", effort: "medium" }
// Effort levels: "high" (default), "medium", "low"
// "high" = almost always thinks
// "low" = skips thinking for simple tasksThe practical impact: agents running Opus 4.6 on mixed workloads (some hard, some trivial) use fewer thinking tokens overall, reducing cost without sacrificing quality on hard problems. Anthropic reported that at medium effort, Opus 4.6 still matches or exceeds Opus 4.5's quality on most benchmarks.
Agent Teams
Agent Teams launched alongside Opus 4.6 in Claude Code. Instead of a single agent working through tasks sequentially, a lead Claude Code session spawns multiple independent teammate instances. Each teammate has its own full context window and can read files, write code, run tests, and report back.
Parallel execution
Multiple teammates work on different files or features simultaneously. No shared context window bottleneck. Each agent gets full 200K (or 1M beta) context.
Task coordination
The lead assigns tasks, teammates execute independently, and results converge through a shared task list. Teammates can challenge each other's findings.
Real-world scale
16 Claude agents wrote a 100K-line C compiler in Rust that compiles the Linux kernel 6.9, passing 99% of GCC torture tests. API cost: approximately $20K.
Agent Teams change the economics of AI-assisted development. Instead of one agent bottlenecked on context and sequential reasoning, you get parallel workstreams that each operate at full Opus-level capability. The constraint shifts from model intelligence to task decomposition quality.
Context and Output
200K Standard, 1M Beta
Opus 4.6 has a 200K token context window by default. The 1M token context window is available in beta by adding the context-1m-2025-08-07 header to API requests. When using the 1M window, input tokens beyond 200K cost $10/M (vs $5/M standard) and output tokens cost $37.50/M (vs $25/M).
128K Output Tokens
Opus 4.6 doubles the max output from 64K to 128K tokens. This is significant for code generation tasks where the model needs to produce complete files, multi-file changes, or detailed implementation plans in a single response.
Context Compaction
A new server-side feature. When a conversation approaches the context window limit, the API automatically summarizes earlier parts of the conversation instead of truncating. This enables effectively infinite conversations without custom truncation logic. Available in beta.
Enabling 1M context and compaction
// 1M context window (beta)
headers: {
"anthropic-beta": "context-1m-2025-08-07"
}
// Context compaction (beta)
// Automatic: API summarizes older messages
// when approaching the window limit.
// No client-side code needed.Pricing
Opus 4.6 costs 67% less than the Opus 4.1 generation ($15/$75) while outperforming it on every benchmark. Multiple pricing tiers let you trade latency for cost.
| Tier | Input / 1M tokens | Output / 1M tokens | Use case |
|---|---|---|---|
| Standard | $5.00 | $25.00 | Default API usage |
| Extended context (>200K) | $10.00 | $37.50 | Requests using 1M beta window |
| Fast mode | $30.00 | $150.00 | Latency-sensitive production |
| Batch API | $2.50 | $12.50 | Async, non-urgent workloads |
| Prompt caching | $0.50 (cached) | $25.00 | Repeated prefixes (up to 90% savings) |
Subscription Access
Opus 4.6 is available through Claude Pro ($20/month) with usage limits, and Claude Max ($100-200/month) with higher or unlimited usage. API access is pay-per-token with no subscription required.
Opus 4.6 vs Sonnet 4.6
Sonnet 4.6 costs 80% less than Opus 4.6 for most tasks. The question is when that 20% capability gap matters.
| Dimension | Opus 4.6 | Sonnet 4.6 |
|---|---|---|
| SWE-bench Verified | 80.8% | 79.6% (1.2 points lower) |
| GPQA Diamond | 91.3% | 74.1% (17 points lower) |
| ARC-AGI-2 | 68.8% | N/A |
| Max output | 128K tokens | 64K tokens |
| Input cost | $5 / 1M tokens | $3 / 1M tokens |
| Output cost | $25 / 1M tokens | $15 / 1M tokens |
| Speed | Slower | Faster (better for latency-sensitive) |
| Agent Teams | Lead or teammate | Teammate only |
The hybrid approach
The standard 2026 pattern: route 80-90% of requests to Sonnet 4.6 (fast, cheap), and escalate to Opus 4.6 only for tasks requiring deep reasoning, multi-agent coordination, or long output generation. For pure coding tasks, Sonnet trails Opus by only 1.2 points on SWE-bench. For scientific reasoning (GPQA Diamond), the 17-point gap makes Opus the only serious option.
Best Use Cases for Opus 4.6
Multi-agent coding workflows
Agent Teams with parallel teammates. Large refactors, multi-file changes, full-project scaffolding. The lead/teammate architecture requires Opus as the lead agent.
Complex debugging
Tracing bugs across large codebases where the issue spans multiple files and requires reasoning about system-level interactions. Opus's deeper reasoning handles cascading failure modes better.
Expert-level reasoning
Scientific analysis, legal document review, financial modeling. The 17-point GPQA Diamond gap over Sonnet means Opus handles expert-domain tasks at a different level.
Long output generation
128K output tokens for complete file generation, detailed implementation plans, or comprehensive code reviews. Sonnet caps at 64K.
When Sonnet 4.6 Is Enough
Standard code completion, simple refactors, test generation, documentation, code review for small changes, and most day-to-day development tasks. The 1.2-point SWE-bench gap is not worth 5x the cost for routine work.
Frequently Asked Questions
What is Opus 4.6?
Opus 4.6 (claude-opus-4-6) is Anthropic's most capable AI model, released February 5, 2026. It is designed for coding, agentic workflows, and complex reasoning. It powers Claude Code and is the first Claude model to support adaptive thinking and Agent Teams.
What are Opus 4.6's benchmark scores?
80.8% SWE-bench Verified, 68.8% ARC-AGI-2, 91.3% GPQA Diamond. It leads Terminal-Bench 2.0 and Humanity's Last Exam. On GDPval-AA (economically valuable knowledge work), it outperforms GPT-5.2 by 144 Elo points.
How much does Opus 4.6 cost?
$5 per million input tokens, $25 per million output tokens at standard rates. Fast mode is $30/$150. Batch API saves 50%. Prompt caching saves up to 90% on input. Extended context beyond 200K costs $10/$37.50.
What is adaptive thinking?
A new reasoning mode where the model dynamically decides how much to think per request. At high effort (default), it almost always thinks. At low, it skips reasoning for simple tasks. Set via thinking: { type: "adaptive" } in the API.
What are Agent Teams?
A Claude Code feature where a lead session spawns independent teammate sessions that work in parallel. Each teammate has its own full context window. The lead coordinates task assignment and result synthesis. Designed for large codebases and complex multi-file changes.
What is the context window?
200K tokens by default. 1M tokens available in beta via the context-1m-2025-08-07 header. Context compaction (also beta) automatically summarizes older conversation parts to enable effectively infinite sessions.
Is Opus 4.6 better than GPT-5.2 for coding?
On SWE-bench Verified, Opus 4.6 (80.8%) edges GPT-5.2 (80.0%). On ARC-AGI-2, the gap is larger: 68.8% vs 54.2%. On GPQA Diamond, GPT-5.2 Pro leads at 93.2% vs 91.3%. For coding specifically, both are within statistical noise of each other. The decision usually comes down to ecosystem (Claude Code vs Codex) and pricing.
Should I use Opus 4.6 or Sonnet 4.6?
Use Sonnet 4.6 for 80-90% of tasks. It scores within 1.2 points on coding benchmarks at 5x lower cost. Escalate to Opus for complex reasoning (17-point GPQA gap), multi-agent coordination (Agent Teams), or when you need 128K output tokens.
Related Articles
Build Faster with Opus 4.6 and WarpGrep
WarpGrep is an agentic code search MCP server. Connect it to Claude Code running Opus 4.6, and your agent finds the right code in sub-6 seconds instead of burning context on grep loops. 8 parallel tool calls per turn, 4 turns, done.
Sources
- Anthropic: Introducing Claude Opus 4.6 (February 5, 2026)
- Claude API Docs: What's new in Claude 4.6
- Claude API Docs: Models overview
- Claude API Docs: Pricing
- Claude API Docs: Adaptive thinking
- Vellum: Claude Opus 4.6 vs 4.5 Benchmarks (Explained)
- TechCrunch: Anthropic releases Opus 4.6 with new "agent teams"
- Microsoft Azure Blog: Claude Opus 4.6 on Azure Foundry