Opus 4.6: Anthropic's Flagship Coding Model, Explained (2026)

Claude Opus 4.6 scores 80.8% SWE-bench Verified, 68.8% ARC-AGI-2, and 91.3% GPQA Diamond. 200K context (1M beta), 128K output, adaptive thinking, and Agent Teams. Full specs, benchmarks, pricing, and when to use it over Sonnet 4.6.

March 4, 2026 ยท 1 min read

TL;DR

Opus 4.6 is Anthropic's flagship model, released February 5, 2026. Model ID: claude-opus-4-6. It scores 80.8% SWE-bench Verified, 68.8% ARC-AGI-2, and 91.3% GPQA Diamond. 200K default context (1M in beta), 128K max output, adaptive thinking that scales reasoning depth per task. It powers Claude Code and introduced Agent Teams for multi-agent parallel workflows.

What it is

Anthropic's most intelligent model. Optimized for coding, agentic tasks, and expert-level reasoning. Default backbone of Claude Code. $5/$25 per million tokens input/output.

Why it matters

ARC-AGI-2 jumped from 37.6% to 68.8%, nearly doubling abstract reasoning capability. Agent Teams let multiple Opus instances coordinate on a single codebase in parallel. This is the model you escalate to when Sonnet isn't enough.

Model ID

claude-opus-4-6 in the API. Available on Anthropic's API, Amazon Bedrock, Google Vertex AI, and Microsoft Azure Foundry.

Key Specs

200K
Context window (1M beta)
128K
Max output tokens
$5 / $25
Input / output per 1M tokens
80.8%
SWE-bench Verified
68.8%
ARC-AGI-2
91.3%
GPQA Diamond
SpecificationValueNotes
Model IDclaude-opus-4-6API identifier
Release dateFebruary 5, 2026
Context window200K tokens (1M beta)Beta via context-1m-2025-08-07 header
Max output tokens128,000Double Opus 4.5's 64K
Thinking modeAdaptive thinkingDynamic reasoning depth per task
Input pricing$5 / 1M tokens67% cheaper than Opus 4.1 ($15)
Output pricing$25 / 1M tokens67% cheaper than Opus 4.1 ($75)
Fast mode$30 / $150 per 1M tokens6x premium for faster output
Batch API50% savingsFor non-latency-sensitive workloads
Prompt cachingUp to 90% input savingsCache repeated prefixes
Agent TeamsSupportedMulti-agent parallel coordination
Context compactionSupported (beta)Auto-summarize to extend conversations

Benchmarks

Opus 4.6 leads or matches every frontier model on coding and reasoning benchmarks as of March 2026. The standout result is ARC-AGI-2, where it nearly doubled the previous generation's score, suggesting a fundamental improvement in abstract reasoning rather than incremental tuning.

BenchmarkOpus 4.6Sonnet 4.6GPT-5.2Gemini 3 Pro
SWE-bench Verified80.8%79.6%80.0%76.2%
GPQA Diamond91.3%74.1%93.2%91.9%
ARC-AGI-268.8%N/A54.2%45.1%
Context window200K (1M beta)200K (1M beta)128K1M
Max output128K64K64K65K
Input / 1M tokens$5$3$15$3.50

SWE-bench Verified: 80.8%

SWE-bench Verified tests real GitHub issue resolution: read a production codebase, understand the bug, write a patch, pass the test suite. Opus 4.6's 80.8% essentially matches Opus 4.5's 80.9%. Anthropic prioritized reasoning depth over incremental coding gains in this release.

ARC-AGI-2: 68.8%

This is the number that matters most. ARC-AGI-2 measures novel pattern recognition, the kind of abstract reasoning that previous models struggled with. Opus 4.5 scored 37.6%. Opus 4.6 scores 68.8%. A 31-point jump. GPT-5.2 Pro scores 54.2%. Gemini 3 Pro (Deep Thinking) scores 45.1%. Nothing else is close.

GPQA Diamond: 91.3%

PhD-level scientific reasoning across physics, chemistry, and biology. Opus 4.6 scores 91.3%, up from 87.0% on Opus 4.5. GPT-5.2 Pro leads this benchmark at 93.2%, with Gemini 3 Pro at 91.9%. The gap between Opus and Sonnet (91.3% vs 74.1%) is 17 points, the largest capability split between the two models.

Other Benchmarks

Opus 4.6 leads Terminal-Bench 2.0 (agentic coding evaluation) and Humanity's Last Exam (complex multidisciplinary reasoning) among all frontier models. On GDPval-AA, which measures economically valuable knowledge work in finance, legal, and other domains, Opus 4.6 outperforms GPT-5.2 by 144 Elo points.

What Changed from Opus 4.5

DimensionOpus 4.5Opus 4.6
ARC-AGI-237.6%68.8% (+31.2 points)
GPQA Diamond87.0%91.3% (+4.3 points)
SWE-bench Verified80.9%80.8% (essentially flat)
Max output tokens64K128K (doubled)
Thinking modeExtended thinking (fixed)Adaptive thinking (dynamic)
Agent supportSingle agentAgent Teams (multi-agent parallel)
Context compactionNot availableAvailable (beta)
Fast modeNot availableAvailable ($30/$150 per 1M)
Pricing (input/output)$15 / $75$5 / $25 (67% cheaper)

The pattern is clear: Anthropic held coding performance flat while dramatically improving reasoning (ARC-AGI-2 doubled), cutting price by 67%, doubling output length, and adding the platform features (adaptive thinking, compaction, Agent Teams) that make Opus viable for long-running agentic workflows.

Adaptive Thinking

Previous Claude models used extended thinking: a fixed reasoning phase before every response, regardless of task complexity. A simple "rename this variable" got the same thinking budget as "refactor this authentication system." Wasteful.

Opus 4.6 replaces this with adaptive thinking. The model decides, per request, how much reasoning the task requires. At the default high effort level, it almost always thinks. At medium or low, it skips reasoning for simple tasks entirely.

Enabling adaptive thinking

// Recommended: let the model decide
thinking: { type: "adaptive" }

// Or control effort explicitly
thinking: { type: "adaptive", effort: "medium" }

// Effort levels: "high" (default), "medium", "low"
// "high" = almost always thinks
// "low"  = skips thinking for simple tasks

The practical impact: agents running Opus 4.6 on mixed workloads (some hard, some trivial) use fewer thinking tokens overall, reducing cost without sacrificing quality on hard problems. Anthropic reported that at medium effort, Opus 4.6 still matches or exceeds Opus 4.5's quality on most benchmarks.

Agent Teams

Agent Teams launched alongside Opus 4.6 in Claude Code. Instead of a single agent working through tasks sequentially, a lead Claude Code session spawns multiple independent teammate instances. Each teammate has its own full context window and can read files, write code, run tests, and report back.

Parallel execution

Multiple teammates work on different files or features simultaneously. No shared context window bottleneck. Each agent gets full 200K (or 1M beta) context.

Task coordination

The lead assigns tasks, teammates execute independently, and results converge through a shared task list. Teammates can challenge each other's findings.

Real-world scale

16 Claude agents wrote a 100K-line C compiler in Rust that compiles the Linux kernel 6.9, passing 99% of GCC torture tests. API cost: approximately $20K.

Agent Teams change the economics of AI-assisted development. Instead of one agent bottlenecked on context and sequential reasoning, you get parallel workstreams that each operate at full Opus-level capability. The constraint shifts from model intelligence to task decomposition quality.

Context and Output

200K Standard, 1M Beta

Opus 4.6 has a 200K token context window by default. The 1M token context window is available in beta by adding the context-1m-2025-08-07 header to API requests. When using the 1M window, input tokens beyond 200K cost $10/M (vs $5/M standard) and output tokens cost $37.50/M (vs $25/M).

128K Output Tokens

Opus 4.6 doubles the max output from 64K to 128K tokens. This is significant for code generation tasks where the model needs to produce complete files, multi-file changes, or detailed implementation plans in a single response.

Context Compaction

A new server-side feature. When a conversation approaches the context window limit, the API automatically summarizes earlier parts of the conversation instead of truncating. This enables effectively infinite conversations without custom truncation logic. Available in beta.

Enabling 1M context and compaction

// 1M context window (beta)
headers: {
  "anthropic-beta": "context-1m-2025-08-07"
}

// Context compaction (beta)
// Automatic: API summarizes older messages
// when approaching the window limit.
// No client-side code needed.

Pricing

Opus 4.6 costs 67% less than the Opus 4.1 generation ($15/$75) while outperforming it on every benchmark. Multiple pricing tiers let you trade latency for cost.

TierInput / 1M tokensOutput / 1M tokensUse case
Standard$5.00$25.00Default API usage
Extended context (>200K)$10.00$37.50Requests using 1M beta window
Fast mode$30.00$150.00Latency-sensitive production
Batch API$2.50$12.50Async, non-urgent workloads
Prompt caching$0.50 (cached)$25.00Repeated prefixes (up to 90% savings)

Subscription Access

Opus 4.6 is available through Claude Pro ($20/month) with usage limits, and Claude Max ($100-200/month) with higher or unlimited usage. API access is pay-per-token with no subscription required.

Opus 4.6 vs Sonnet 4.6

Sonnet 4.6 costs 80% less than Opus 4.6 for most tasks. The question is when that 20% capability gap matters.

DimensionOpus 4.6Sonnet 4.6
SWE-bench Verified80.8%79.6% (1.2 points lower)
GPQA Diamond91.3%74.1% (17 points lower)
ARC-AGI-268.8%N/A
Max output128K tokens64K tokens
Input cost$5 / 1M tokens$3 / 1M tokens
Output cost$25 / 1M tokens$15 / 1M tokens
SpeedSlowerFaster (better for latency-sensitive)
Agent TeamsLead or teammateTeammate only

The hybrid approach

The standard 2026 pattern: route 80-90% of requests to Sonnet 4.6 (fast, cheap), and escalate to Opus 4.6 only for tasks requiring deep reasoning, multi-agent coordination, or long output generation. For pure coding tasks, Sonnet trails Opus by only 1.2 points on SWE-bench. For scientific reasoning (GPQA Diamond), the 17-point gap makes Opus the only serious option.

Best Use Cases for Opus 4.6

Multi-agent coding workflows

Agent Teams with parallel teammates. Large refactors, multi-file changes, full-project scaffolding. The lead/teammate architecture requires Opus as the lead agent.

Complex debugging

Tracing bugs across large codebases where the issue spans multiple files and requires reasoning about system-level interactions. Opus's deeper reasoning handles cascading failure modes better.

Expert-level reasoning

Scientific analysis, legal document review, financial modeling. The 17-point GPQA Diamond gap over Sonnet means Opus handles expert-domain tasks at a different level.

Long output generation

128K output tokens for complete file generation, detailed implementation plans, or comprehensive code reviews. Sonnet caps at 64K.

When Sonnet 4.6 Is Enough

Standard code completion, simple refactors, test generation, documentation, code review for small changes, and most day-to-day development tasks. The 1.2-point SWE-bench gap is not worth 5x the cost for routine work.

Frequently Asked Questions

What is Opus 4.6?

Opus 4.6 (claude-opus-4-6) is Anthropic's most capable AI model, released February 5, 2026. It is designed for coding, agentic workflows, and complex reasoning. It powers Claude Code and is the first Claude model to support adaptive thinking and Agent Teams.

What are Opus 4.6's benchmark scores?

80.8% SWE-bench Verified, 68.8% ARC-AGI-2, 91.3% GPQA Diamond. It leads Terminal-Bench 2.0 and Humanity's Last Exam. On GDPval-AA (economically valuable knowledge work), it outperforms GPT-5.2 by 144 Elo points.

How much does Opus 4.6 cost?

$5 per million input tokens, $25 per million output tokens at standard rates. Fast mode is $30/$150. Batch API saves 50%. Prompt caching saves up to 90% on input. Extended context beyond 200K costs $10/$37.50.

What is adaptive thinking?

A new reasoning mode where the model dynamically decides how much to think per request. At high effort (default), it almost always thinks. At low, it skips reasoning for simple tasks. Set via thinking: { type: "adaptive" } in the API.

What are Agent Teams?

A Claude Code feature where a lead session spawns independent teammate sessions that work in parallel. Each teammate has its own full context window. The lead coordinates task assignment and result synthesis. Designed for large codebases and complex multi-file changes.

What is the context window?

200K tokens by default. 1M tokens available in beta via the context-1m-2025-08-07 header. Context compaction (also beta) automatically summarizes older conversation parts to enable effectively infinite sessions.

Is Opus 4.6 better than GPT-5.2 for coding?

On SWE-bench Verified, Opus 4.6 (80.8%) edges GPT-5.2 (80.0%). On ARC-AGI-2, the gap is larger: 68.8% vs 54.2%. On GPQA Diamond, GPT-5.2 Pro leads at 93.2% vs 91.3%. For coding specifically, both are within statistical noise of each other. The decision usually comes down to ecosystem (Claude Code vs Codex) and pricing.

Should I use Opus 4.6 or Sonnet 4.6?

Use Sonnet 4.6 for 80-90% of tasks. It scores within 1.2 points on coding benchmarks at 5x lower cost. Escalate to Opus for complex reasoning (17-point GPQA gap), multi-agent coordination (Agent Teams), or when you need 128K output tokens.

Related Articles

Build Faster with Opus 4.6 and WarpGrep

WarpGrep is an agentic code search MCP server. Connect it to Claude Code running Opus 4.6, and your agent finds the right code in sub-6 seconds instead of burning context on grep loops. 8 parallel tool calls per turn, 4 turns, done.

Sources