Claude Benchmarks (2026): Every Score for Opus 4.6, Sonnet 4.6 & Haiku

Every published benchmark score for Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5 in one place. Comparison tables against GPT-5.3-Codex, Gemini 3.1 Pro, DeepSeek, and the rest of the frontier.

Scores verified March 5, 2026

Claude Model Overview

Anthropic ships three Claude tiers. Opus is the flagship for hard tasks. Sonnet is the workhorse that closes within a few points of Opus on most benchmarks at 1/5 the cost. Haiku is the speed tier for high-volume and latency-sensitive work.

80.8%

Opus 4.6 SWE-bench Verified

79.6%

Sonnet 4.6 SWE-bench Verified

73.3%

Haiku 4.5 SWE-bench Verified

91.3%

Opus 4.6 GPQA Diamond

Model	Released	Context	Input Price	Output Price
Claude Opus 4.6	Feb 2026	1M tokens	$15/M	$75/M
Claude Sonnet 4.6	Feb 2026	1M tokens	$3/M	$15/M
Claude Haiku 4.5	Oct 2025	200K tokens	$0.80/M	$4/M

SWE-bench Verified Scores

SWE-bench Verified contains 500 human-validated Python tasks from real GitHub repositories (Django, Matplotlib, Scikit-learn, etc.). Each task is a real bug report or feature request paired with tests. Models are scored on the percentage of tasks they resolve correctly.

Contamination caveat

OpenAI confirmed that every frontier model shows training data contamination on SWE-bench Verified. They stopped reporting Verified scores and recommend SWE-bench Pro instead. The scores below are still useful for relative comparison, but treat the absolute numbers with skepticism.

Rank	Model	Score	Provider
1	Claude Opus 4.5	80.9%	Anthropic
2	Claude Opus 4.6	80.8%	Anthropic
3	Gemini 3.1 Pro	80.6%	Google
4	MiniMax M2.5	80.2%	MiniMax
5	GPT-5.2	80.0%	OpenAI
6	Claude Sonnet 4.6	79.6%	Anthropic
7	Sonar Foundation Agent	79.2%	Sonar
8	Gemini 3 Flash	78.0%	Google
9	GLM-5	77.8%	Zhipu AI
10	Claude Sonnet 4.5	77.2%	Anthropic
11	Kimi K2.5	76.8%	Moonshot
12	Gemini 3 Pro	76.2%	Google
13	GPT-5.1	74.9%	OpenAI
14	Grok 4	73.5%	xAI
15	Claude Haiku 4.5	73.3%	Anthropic
16	DeepSeek V3.2	73.0%	DeepSeek
17	Claude Sonnet 4	72.7%	Anthropic

Scores are self-reported by model providers. Scaffold and harness differences affect results. Source: aggregated from swebench.com and provider announcements.

Six of the top 17 entries are Claude models. Opus 4.5 and 4.6 trade the #1 and #2 spots with a 0.1 point margin. Sonnet 4.6 at #6 (79.6%) costs $3/$15 per million tokens, undercutting every model above it except Gemini 3 Flash. Haiku 4.5 at #15 (73.3%) beats GPT-5.1 and most open-source models while costing $0.80/$4.00.

SWE-bench Pro Scores

SWE-bench Pro is Scale AI's harder benchmark: 1,865 multi-language tasks requiring an average of 107 lines across 4.1 files. It resists contamination through GPL licensing and proprietary codebases.

SEAL Leaderboard (Standardized Scaffolding)

The SEAL leaderboard uses Scale AI's unified scaffolding with a 250-turn limit. This isolates model capability by holding the agent framework constant.

Rank	Model	Score	CI
1	Claude Opus 4.5	45.9%	±3.60
2	Claude Sonnet 4.5	43.6%	±3.60
3	Gemini 3 Pro	43.3%	±3.60
4	Claude Sonnet 4	42.7%	±3.59
5	GPT-5 (High)	41.8%	±3.49
6	GPT-5.2 Codex	41.0%	±3.57
7	Claude Haiku 4.5	39.5%	±3.55
8	Qwen3 Coder 480B	38.7%	±3.55

Claude takes four of the top seven slots. Opus 4.5 leads at 45.9%, a 2.3-point gap over the next model. Haiku 4.5 at 39.5% outperforms GPT-5.2 Codex (41.0% is within the confidence interval) while costing a fraction per token.

Agent Systems (Custom Scaffolding)

Agent systems bring their own frameworks, including specialized context retrieval, longer turn limits, and tool access. These scores are not directly comparable to SEAL scores.

Agent	Base Model	Score
GPT-5.3-Codex (CLI)	GPT-5.3-Codex	57.0%
Auggie	Opus 4.5	51.8%
Cursor	Opus 4.5	50.2%
Claude Code	Opus 4.5	55.4%

The gap between Auggie (51.8%) and Claude Code (55.4%), both running Opus 4.5, shows that scaffolding matters on SWE-bench Pro. The model matters, but the agent framework matters almost as much.

WarpGrep impact

In Morph internal benchmarks, adding WarpGrep v2 as a search subagent lifted Opus 4.6 from 55.4% to 57.5% on SWE-bench Pro while making it 15.6% cheaper and 28% faster. WarpGrep is an RL-trained search model that runs in its own context window and issues up to 8 parallel tool calls per turn.

Terminal-Bench 2.0 Scores

Terminal-Bench evaluates AI models on complex terminal tasks: system administration, environment management, debugging, and multi-step operations in a live terminal. Unlike SWE-bench, which focuses on code changes, Terminal-Bench tests whether a model can operate a computer through a command line.

Rank	Model	Score
1	GPT-5.3-Codex	77.3%
2	Gemini 3.1 Pro	77.3%
3	Claude Opus 4.6	65.4%

This is one benchmark where Claude trails. GPT-5.3-Codex and Gemini 3.1 Pro both score 77.3%, 12 points above Opus 4.6. Terminal-Bench rewards fast iteration and tool-use efficiency. The models that perform best here tend to issue shorter, more targeted commands rather than planning long sequences.

HumanEval & MBPP Scores

HumanEval (164 problems) and MBPP (974 problems) are the original code generation benchmarks. They test isolated function-level coding: given a docstring, generate a correct implementation. Every frontier model now scores 90%+ on HumanEval, which limits its ability to differentiate.

Model	HumanEval	Notes
DeepSeek R1	96.1%	Highest reported
Claude Opus 4.6	95.0%
GPT-5.2	95.0%
Codestral 25.01	86.6%	Open-weight

HumanEval is largely saturated at the frontier. The 1-point gap between DeepSeek R1 (96.1%) and Claude Opus 4.6 (95.0%) is not meaningful in practice. For distinguishing between top models, SWE-bench and Terminal-Bench are more informative because they test multi-file, real-world tasks rather than isolated functions.

Beyond HumanEval

EvalPlus extends HumanEval with 80x more test cases per problem, reducing false positives from solutions that pass narrow tests but fail on edge cases. HumanEval Pro tests self-invoking code generation, where even o1-mini drops from 96.2% to 76.2%. BigCodeBench adds realistic library usage and API calls. If you need to evaluate models on code generation specifically, these are better tools than raw HumanEval.

Reasoning & Knowledge Benchmarks

Coding is not just pattern matching. Hard bugs require reasoning about state, constraints, and interactions across systems. These benchmarks measure the reasoning capability that underpins coding performance.

Benchmark	Score	What It Measures
GPQA Diamond	91.3%	Graduate-level science (physics, chemistry, biology) validated by domain experts
ARC-AGI-2	68.8%	Abstract reasoning and novel pattern recognition
MMMU Pro	77.3%	Multimodal reasoning across academic disciplines (with tools)
BigLaw Bench	90.2%	Complex legal reasoning and contract analysis
Humanity's Last Exam	Top score	Expert-level questions across all domains
MRCR v2	76.0%	Long-context retrieval (vs 18.5% for Opus 4.5)

Sonnet 4.6 Reasoning

Benchmark	Score	Notes
ARC-AGI-2	58.3%	4.3x improvement over Sonnet 4.5 (13.6%)
GPQA Diamond	74.1%	Opus 4.6 leads at 91.3%
SWE-bench Verified	79.6%	Within 1.2 points of Opus 4.6
OSWorld-Verified	72.5%	Nearly matches Opus 4.6 on computer use
MCP-Atlas (tool use)	61.3%	Ahead of Opus 4.6 (60.3%)
Finance Agent	63.3%	Office productivity leader at 1633 Elo

Sonnet 4.6's ARC-AGI-2 score (58.3%) is the standout: a 4.3x leap over Sonnet 4.5 (13.6%), the largest single-generation gain on this benchmark. It also edges ahead of Opus 4.6 on MCP-Atlas, a tool-use benchmark, suggesting Sonnet 4.6 has been specifically tuned for agentic tool calling.

Claude vs GPT-5 vs Gemini: Head-to-Head

No single model wins every benchmark. Claude leads on SWE-bench Verified and GPQA. GPT-5.3-Codex leads on Terminal-Bench and SWE-bench Pro (with custom scaffolding). Gemini 3.1 Pro matches GPT-5.3 on Terminal-Bench and leads the SEAL leaderboard behind Claude.

Benchmark	Claude Opus 4.6	GPT-5.3-Codex	Gemini 3.1 Pro
SWE-bench Verified	80.8%	80.0%*	80.6%
SWE-bench Pro (SEAL)	45.9%**	41.8%***	43.3%
SWE-bench Pro (agent)	57.5%****	57.0%	N/A
Terminal-Bench 2.0	65.4%	77.3%	77.3%
HumanEval	95.0%	95.0%	N/A
GPQA Diamond	91.3%	N/A	N/A
ARC-AGI-2	68.8%	N/A	N/A
OSWorld-Verified	72.7%	64.7%	N/A

*GPT-5.2 score (5.3 not separately reported on Verified). **Opus 4.5 score on SEAL. ***GPT-5 High on SEAL. ****With WarpGrep v2, Morph internal.

The takeaway: at the frontier, model choice depends on task shape. For multi-file bug fixing and code review, Claude Opus 4.6 has a slight edge. For terminal-heavy development with fast iteration, GPT-5.3-Codex dominates. For cost-sensitive production workloads, Gemini 3.1 Pro competes on most benchmarks at lower cost.

Cost-Adjusted Comparison

Model	Input	Output	SWE-bench Verified
Claude Opus 4.6	$15	$75	80.8%
Claude Sonnet 4.6	$3	$15	79.6%
GPT-5.3-Codex	$10	$40	~80.0%
Gemini 3.1 Pro	$2.50	$15	80.6%
Claude Haiku 4.5	$0.80	$4	73.3%
DeepSeek V3.2	$0.55	$2.19	73.0%

Sonnet 4.6 and Gemini 3.1 Pro are the most cost-effective frontier options. Both score within 1.2 points of Opus 4.6 on SWE-bench Verified at roughly 1/5 the token cost. For budget-constrained applications, Haiku 4.5 and DeepSeek V3.2 offer 73%+ SWE-bench performance at less than $5/M output tokens.

Historical Progression: Claude SWE-bench Scores

Claude's SWE-bench Verified scores have more than doubled in two years.

Model	Release	SWE-bench Verified	Jump
Claude 3 Opus	Mar 2024	33.4%	Baseline
Claude 3.5 Sonnet	Jun 2024	49.0%	+15.6
Claude 3.7 Sonnet	Feb 2025	62.3%	+13.3
Claude Sonnet 4.5	Oct 2025	77.2%	+14.9
Claude Opus 4.5	Oct 2025	80.9%	N/A
Claude Sonnet 4.6	Feb 2026	79.6%	+2.4 vs Sonnet 4.5
Claude Opus 4.6	Feb 2026	80.8%	-0.1 vs Opus 4.5

The trajectory: 33.4% to 80.8% in two years. The most dramatic gains came between Claude 3 Opus and Sonnet 4.5, where each generation added 13-16 points. The 4.5-to-4.6 jump is smaller (0.2-2.4 points on Verified) because Anthropic focused the 4.6 generation on other capabilities: 1M context, long-context retrieval (MRCR v2 jumped from 18.5% to 76%), and agentic tool use.

SWE-bench Verified may also be approaching a ceiling. Contamination, test flaws, and benchmark saturation all compress the gap between frontier models. On SWE-bench Pro, which resists these issues, the spread is much wider (45.9% vs. 15.6% between top and bottom), leaving more room for differentiation.

Which Benchmark Matters Most for Real Coding?

Benchmarks measure different things. Choosing the wrong one gives you a misleading picture of model capability.

Benchmark	What It Tests	When to Use It
SWE-bench Pro	Multi-file, multi-language bug fixing in real repos	Evaluating agents for production software engineering
SWE-bench Verified	Single-file Python bug fixing	Quick directional comparison (caveat: contaminated)
Terminal-Bench 2.0	Terminal commands, sysadmin, env management	Evaluating DevOps and infrastructure agents
HumanEval / MBPP	Isolated function generation	Baseline sanity check (saturated at frontier)
GPQA Diamond	Graduate-level science reasoning	Evaluating reasoning depth for hard debugging
ARC-AGI-2	Novel pattern recognition	Measuring generalization to unseen problem types

For coding agents that will operate on real codebases, SWE-bench Pro is the best single benchmark. It tests the full pipeline: understanding a codebase, locating relevant files, reasoning about the fix, and implementing it across multiple files. The SEAL leaderboard's standardized scaffolding makes scores directly comparable.

Terminal-Bench matters if your use case involves infrastructure, deployment, or debugging in a terminal. HumanEval is a sanity check, not a differentiator. GPQA and ARC-AGI-2 matter if you need the model to reason about novel problems rather than apply known patterns.

No single number captures coding ability. A model that scores 80% on SWE-bench Verified but 15% on SWE-bench Pro (like some smaller models) is memorizing solutions, not solving problems. Look at the full picture.

Frequently Asked Questions

What is Claude Opus 4.6's SWE-bench Verified score?

Claude Opus 4.6 scores 80.8% on SWE-bench Verified. Opus 4.5 is marginally higher at 80.9%. Both were tested using a simple scaffold with bash and file editing tools.

What is Claude Opus 4.6's SWE-bench Pro score?

On the SEAL leaderboard, Claude Opus 4.5 leads at 45.9% with standardized scaffolding. When paired with WarpGrep v2, Opus 4.6 reaches 57.5% (Morph internal benchmark).

How does Claude compare to GPT-5 on coding benchmarks?

On SWE-bench Verified, Claude Opus 4.6 (80.8%) and GPT-5.3-Codex (~80.0%) are within 1 point. On SWE-bench Pro with custom scaffolding, GPT-5.3-Codex leads at 57%. On Terminal-Bench 2.0, GPT-5.3-Codex scores 77.3% vs. Opus 4.6's 65.4%. On GPQA Diamond, Opus 4.6 leads at 91.3%. Different benchmarks favor different models.

What is Claude Sonnet 4.6's SWE-bench score?

Sonnet 4.6 scores 79.6% on SWE-bench Verified, within 1.2 points of Opus 4.6 at 1/5 the cost ($3/$15 per million tokens vs. $15/$75).

What is Claude Haiku 4.5's SWE-bench score?

Haiku 4.5 scores 73.3% on SWE-bench Verified and 39.5% on SWE-bench Pro (SEAL). At $0.80/$4.00 per million tokens, it reaches 90% of Sonnet 4.5's performance in agentic coding evaluations.

Is SWE-bench Verified still a reliable benchmark?

Partially. OpenAI confirmed that every frontier model shows training data contamination on the dataset, and 59.4% of the hardest unsolved tasks have flawed tests. It still differentiates between weaker models and runs quickly, but SWE-bench Pro is a better measure of production readiness.

What does Terminal-Bench measure?

Terminal-Bench 2.0 evaluates AI models on complex terminal and system administration tasks: using a live terminal for environment management, debugging, and multi-step operations. GPT-5.3-Codex and Gemini 3.1 Pro lead at 77.3%, with Claude Opus 4.6 at 65.4%.

Which Claude model should I use for coding?

Opus 4.6 for maximum accuracy on hard, multi-file tasks. Sonnet 4.6 for everyday coding (79.6% SWE-bench at 1/5 Opus cost). Haiku 4.5 for high-volume tasks like code review or test generation (73.3% SWE-bench at 1/15 Opus cost). In multi-agent setups, pairing a cheaper model with a specialized search subagent like WarpGrep often outperforms using a more expensive model alone.

What is Claude Opus 4.6's GPQA Diamond score?

91.3%. GPQA Diamond contains graduate-level science questions validated by domain experts. This is one of the highest scores on this benchmark across all models.

Build with Claude + WarpGrep

WarpGrep v2 lifted every model it was paired with by 2+ points on SWE-bench Pro. It runs in its own context window, issues 8 parallel tool calls per turn, and makes your coding agent 15.6% cheaper and 28% faster. Available through the Morph API.

Try WarpGrep

See SWE-bench Pro Scores

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers

Claude Benchmarks (2026): Every Score for Opus, Sonnet & Haiku