Claude Benchmarks (2026): Every Score for Opus, Sonnet & Haiku

Complete Claude benchmark scores for 2026. Opus 4.6, Sonnet 4.6, Haiku 4.5 across SWE-bench Verified, SWE-bench Pro, Terminal-Bench, HumanEval, GPQA, and ARC-AGI-2. Comparison tables vs GPT-5.3, Gemini 3.1, DeepSeek.

March 5, 2026 · 1 min read

Every published benchmark score for Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5 in one place. Comparison tables against GPT-5.3-Codex, Gemini 3.1 Pro, DeepSeek, and the rest of the frontier.

Scores verified March 5, 2026

Claude Model Overview

Anthropic ships three Claude tiers. Opus is the flagship for hard tasks. Sonnet is the workhorse that closes within a few points of Opus on most benchmarks at 1/5 the cost. Haiku is the speed tier for high-volume and latency-sensitive work.

80.8%
Opus 4.6 SWE-bench Verified
79.6%
Sonnet 4.6 SWE-bench Verified
73.3%
Haiku 4.5 SWE-bench Verified
91.3%
Opus 4.6 GPQA Diamond
ModelReleasedContextInput PriceOutput Price
Claude Opus 4.6Feb 20261M tokens$15/M$75/M
Claude Sonnet 4.6Feb 20261M tokens$3/M$15/M
Claude Haiku 4.5Oct 2025200K tokens$0.80/M$4/M

SWE-bench Verified Scores

SWE-bench Verified contains 500 human-validated Python tasks from real GitHub repositories (Django, Matplotlib, Scikit-learn, etc.). Each task is a real bug report or feature request paired with tests. Models are scored on the percentage of tasks they resolve correctly.

Contamination caveat

OpenAI confirmed that every frontier model shows training data contamination on SWE-bench Verified. They stopped reporting Verified scores and recommend SWE-bench Pro instead. The scores below are still useful for relative comparison, but treat the absolute numbers with skepticism.

RankModelScoreProvider
1Claude Opus 4.580.9%Anthropic
2Claude Opus 4.680.8%Anthropic
3Gemini 3.1 Pro80.6%Google
4MiniMax M2.580.2%MiniMax
5GPT-5.280.0%OpenAI
6Claude Sonnet 4.679.6%Anthropic
7Sonar Foundation Agent79.2%Sonar
8Gemini 3 Flash78.0%Google
9GLM-577.8%Zhipu AI
10Claude Sonnet 4.577.2%Anthropic
11Kimi K2.576.8%Moonshot
12Gemini 3 Pro76.2%Google
13GPT-5.174.9%OpenAI
14Grok 473.5%xAI
15Claude Haiku 4.573.3%Anthropic
16DeepSeek V3.273.0%DeepSeek
17Claude Sonnet 472.7%Anthropic

Scores are self-reported by model providers. Scaffold and harness differences affect results. Source: aggregated from swebench.com and provider announcements.

Six of the top 17 entries are Claude models. Opus 4.5 and 4.6 trade the #1 and #2 spots with a 0.1 point margin. Sonnet 4.6 at #6 (79.6%) costs $3/$15 per million tokens, undercutting every model above it except Gemini 3 Flash. Haiku 4.5 at #15 (73.3%) beats GPT-5.1 and most open-source models while costing $0.80/$4.00.

SWE-bench Pro Scores

SWE-bench Pro is Scale AI's harder benchmark: 1,865 multi-language tasks requiring an average of 107 lines across 4.1 files. It resists contamination through GPL licensing and proprietary codebases.

SEAL Leaderboard (Standardized Scaffolding)

The SEAL leaderboard uses Scale AI's unified scaffolding with a 250-turn limit. This isolates model capability by holding the agent framework constant.

RankModelScoreCI
1Claude Opus 4.545.9%±3.60
2Claude Sonnet 4.543.6%±3.60
3Gemini 3 Pro43.3%±3.60
4Claude Sonnet 442.7%±3.59
5GPT-5 (High)41.8%±3.49
6GPT-5.2 Codex41.0%±3.57
7Claude Haiku 4.539.5%±3.55
8Qwen3 Coder 480B38.7%±3.55

Claude takes four of the top seven slots. Opus 4.5 leads at 45.9%, a 2.3-point gap over the next model. Haiku 4.5 at 39.5% outperforms GPT-5.2 Codex (41.0% is within the confidence interval) while costing a fraction per token.

Agent Systems (Custom Scaffolding)

Agent systems bring their own frameworks, including specialized context retrieval, longer turn limits, and tool access. These scores are not directly comparable to SEAL scores.

AgentBase ModelScore
GPT-5.3-Codex (CLI)GPT-5.3-Codex57.0%
AuggieOpus 4.551.8%
CursorOpus 4.550.2%
Claude CodeOpus 4.555.4%

The gap between Auggie (51.8%) and Claude Code (55.4%), both running Opus 4.5, shows that scaffolding matters on SWE-bench Pro. The model matters, but the agent framework matters almost as much.

WarpGrep impact

In Morph internal benchmarks, adding WarpGrep v2 as a search subagent lifted Opus 4.6 from 55.4% to 57.5% on SWE-bench Pro while making it 15.6% cheaper and 28% faster. WarpGrep is an RL-trained search model that runs in its own context window and issues up to 8 parallel tool calls per turn.

Terminal-Bench 2.0 Scores

Terminal-Bench evaluates AI models on complex terminal tasks: system administration, environment management, debugging, and multi-step operations in a live terminal. Unlike SWE-bench, which focuses on code changes, Terminal-Bench tests whether a model can operate a computer through a command line.

RankModelScore
1GPT-5.3-Codex77.3%
2Gemini 3.1 Pro77.3%
3Claude Opus 4.665.4%

This is one benchmark where Claude trails. GPT-5.3-Codex and Gemini 3.1 Pro both score 77.3%, 12 points above Opus 4.6. Terminal-Bench rewards fast iteration and tool-use efficiency. The models that perform best here tend to issue shorter, more targeted commands rather than planning long sequences.

HumanEval & MBPP Scores

HumanEval (164 problems) and MBPP (974 problems) are the original code generation benchmarks. They test isolated function-level coding: given a docstring, generate a correct implementation. Every frontier model now scores 90%+ on HumanEval, which limits its ability to differentiate.

ModelHumanEvalNotes
DeepSeek R196.1%Highest reported
Claude Opus 4.695.0%
GPT-5.295.0%
Codestral 25.0186.6%Open-weight

HumanEval is largely saturated at the frontier. The 1-point gap between DeepSeek R1 (96.1%) and Claude Opus 4.6 (95.0%) is not meaningful in practice. For distinguishing between top models, SWE-bench and Terminal-Bench are more informative because they test multi-file, real-world tasks rather than isolated functions.

Beyond HumanEval

EvalPlus extends HumanEval with 80x more test cases per problem, reducing false positives from solutions that pass narrow tests but fail on edge cases. HumanEval Pro tests self-invoking code generation, where even o1-mini drops from 96.2% to 76.2%. BigCodeBench adds realistic library usage and API calls. If you need to evaluate models on code generation specifically, these are better tools than raw HumanEval.

Reasoning & Knowledge Benchmarks

Coding is not just pattern matching. Hard bugs require reasoning about state, constraints, and interactions across systems. These benchmarks measure the reasoning capability that underpins coding performance.

BenchmarkScoreWhat It Measures
GPQA Diamond91.3%Graduate-level science (physics, chemistry, biology) validated by domain experts
ARC-AGI-268.8%Abstract reasoning and novel pattern recognition
MMMU Pro77.3%Multimodal reasoning across academic disciplines (with tools)
BigLaw Bench90.2%Complex legal reasoning and contract analysis
Humanity's Last ExamTop scoreExpert-level questions across all domains
MRCR v276.0%Long-context retrieval (vs 18.5% for Opus 4.5)

Sonnet 4.6 Reasoning

BenchmarkScoreNotes
ARC-AGI-258.3%4.3x improvement over Sonnet 4.5 (13.6%)
GPQA Diamond74.1%Opus 4.6 leads at 91.3%
SWE-bench Verified79.6%Within 1.2 points of Opus 4.6
OSWorld-Verified72.5%Nearly matches Opus 4.6 on computer use
MCP-Atlas (tool use)61.3%Ahead of Opus 4.6 (60.3%)
Finance Agent63.3%Office productivity leader at 1633 Elo

Sonnet 4.6's ARC-AGI-2 score (58.3%) is the standout: a 4.3x leap over Sonnet 4.5 (13.6%), the largest single-generation gain on this benchmark. It also edges ahead of Opus 4.6 on MCP-Atlas, a tool-use benchmark, suggesting Sonnet 4.6 has been specifically tuned for agentic tool calling.

Claude vs GPT-5 vs Gemini: Head-to-Head

No single model wins every benchmark. Claude leads on SWE-bench Verified and GPQA. GPT-5.3-Codex leads on Terminal-Bench and SWE-bench Pro (with custom scaffolding). Gemini 3.1 Pro matches GPT-5.3 on Terminal-Bench and leads the SEAL leaderboard behind Claude.

BenchmarkClaude Opus 4.6GPT-5.3-CodexGemini 3.1 Pro
SWE-bench Verified80.8%80.0%*80.6%
SWE-bench Pro (SEAL)45.9%**41.8%***43.3%
SWE-bench Pro (agent)57.5%****57.0%N/A
Terminal-Bench 2.065.4%77.3%77.3%
HumanEval95.0%95.0%N/A
GPQA Diamond91.3%N/AN/A
ARC-AGI-268.8%N/AN/A
OSWorld-Verified72.7%64.7%N/A

*GPT-5.2 score (5.3 not separately reported on Verified). **Opus 4.5 score on SEAL. ***GPT-5 High on SEAL. ****With WarpGrep v2, Morph internal.

The takeaway: at the frontier, model choice depends on task shape. For multi-file bug fixing and code review, Claude Opus 4.6 has a slight edge. For terminal-heavy development with fast iteration, GPT-5.3-Codex dominates. For cost-sensitive production workloads, Gemini 3.1 Pro competes on most benchmarks at lower cost.

Cost-Adjusted Comparison

ModelInputOutputSWE-bench Verified
Claude Opus 4.6$15$7580.8%
Claude Sonnet 4.6$3$1579.6%
GPT-5.3-Codex$10$40~80.0%
Gemini 3.1 Pro$2.50$1580.6%
Claude Haiku 4.5$0.80$473.3%
DeepSeek V3.2$0.55$2.1973.0%

Sonnet 4.6 and Gemini 3.1 Pro are the most cost-effective frontier options. Both score within 1.2 points of Opus 4.6 on SWE-bench Verified at roughly 1/5 the token cost. For budget-constrained applications, Haiku 4.5 and DeepSeek V3.2 offer 73%+ SWE-bench performance at less than $5/M output tokens.

Historical Progression: Claude SWE-bench Scores

Claude's SWE-bench Verified scores have more than doubled in two years.

ModelReleaseSWE-bench VerifiedJump
Claude 3 OpusMar 202433.4%Baseline
Claude 3.5 SonnetJun 202449.0%+15.6
Claude 3.7 SonnetFeb 202562.3%+13.3
Claude Sonnet 4.5Oct 202577.2%+14.9
Claude Opus 4.5Oct 202580.9%N/A
Claude Sonnet 4.6Feb 202679.6%+2.4 vs Sonnet 4.5
Claude Opus 4.6Feb 202680.8%-0.1 vs Opus 4.5

The trajectory: 33.4% to 80.8% in two years. The most dramatic gains came between Claude 3 Opus and Sonnet 4.5, where each generation added 13-16 points. The 4.5-to-4.6 jump is smaller (0.2-2.4 points on Verified) because Anthropic focused the 4.6 generation on other capabilities: 1M context, long-context retrieval (MRCR v2 jumped from 18.5% to 76%), and agentic tool use.

SWE-bench Verified may also be approaching a ceiling. Contamination, test flaws, and benchmark saturation all compress the gap between frontier models. On SWE-bench Pro, which resists these issues, the spread is much wider (45.9% vs. 15.6% between top and bottom), leaving more room for differentiation.

Which Benchmark Matters Most for Real Coding?

Benchmarks measure different things. Choosing the wrong one gives you a misleading picture of model capability.

BenchmarkWhat It TestsWhen to Use It
SWE-bench ProMulti-file, multi-language bug fixing in real reposEvaluating agents for production software engineering
SWE-bench VerifiedSingle-file Python bug fixingQuick directional comparison (caveat: contaminated)
Terminal-Bench 2.0Terminal commands, sysadmin, env managementEvaluating DevOps and infrastructure agents
HumanEval / MBPPIsolated function generationBaseline sanity check (saturated at frontier)
GPQA DiamondGraduate-level science reasoningEvaluating reasoning depth for hard debugging
ARC-AGI-2Novel pattern recognitionMeasuring generalization to unseen problem types

For coding agents that will operate on real codebases, SWE-bench Pro is the best single benchmark. It tests the full pipeline: understanding a codebase, locating relevant files, reasoning about the fix, and implementing it across multiple files. The SEAL leaderboard's standardized scaffolding makes scores directly comparable.

Terminal-Bench matters if your use case involves infrastructure, deployment, or debugging in a terminal. HumanEval is a sanity check, not a differentiator. GPQA and ARC-AGI-2 matter if you need the model to reason about novel problems rather than apply known patterns.

No single number captures coding ability. A model that scores 80% on SWE-bench Verified but 15% on SWE-bench Pro (like some smaller models) is memorizing solutions, not solving problems. Look at the full picture.

Frequently Asked Questions

What is Claude Opus 4.6's SWE-bench Verified score?

Claude Opus 4.6 scores 80.8% on SWE-bench Verified. Opus 4.5 is marginally higher at 80.9%. Both were tested using a simple scaffold with bash and file editing tools.

What is Claude Opus 4.6's SWE-bench Pro score?

On the SEAL leaderboard, Claude Opus 4.5 leads at 45.9% with standardized scaffolding. When paired with WarpGrep v2, Opus 4.6 reaches 57.5% (Morph internal benchmark).

How does Claude compare to GPT-5 on coding benchmarks?

On SWE-bench Verified, Claude Opus 4.6 (80.8%) and GPT-5.3-Codex (~80.0%) are within 1 point. On SWE-bench Pro with custom scaffolding, GPT-5.3-Codex leads at 57%. On Terminal-Bench 2.0, GPT-5.3-Codex scores 77.3% vs. Opus 4.6's 65.4%. On GPQA Diamond, Opus 4.6 leads at 91.3%. Different benchmarks favor different models.

What is Claude Sonnet 4.6's SWE-bench score?

Sonnet 4.6 scores 79.6% on SWE-bench Verified, within 1.2 points of Opus 4.6 at 1/5 the cost ($3/$15 per million tokens vs. $15/$75).

What is Claude Haiku 4.5's SWE-bench score?

Haiku 4.5 scores 73.3% on SWE-bench Verified and 39.5% on SWE-bench Pro (SEAL). At $0.80/$4.00 per million tokens, it reaches 90% of Sonnet 4.5's performance in agentic coding evaluations.

Is SWE-bench Verified still a reliable benchmark?

Partially. OpenAI confirmed that every frontier model shows training data contamination on the dataset, and 59.4% of the hardest unsolved tasks have flawed tests. It still differentiates between weaker models and runs quickly, but SWE-bench Pro is a better measure of production readiness.

What does Terminal-Bench measure?

Terminal-Bench 2.0 evaluates AI models on complex terminal and system administration tasks: using a live terminal for environment management, debugging, and multi-step operations. GPT-5.3-Codex and Gemini 3.1 Pro lead at 77.3%, with Claude Opus 4.6 at 65.4%.

Which Claude model should I use for coding?

Opus 4.6 for maximum accuracy on hard, multi-file tasks. Sonnet 4.6 for everyday coding (79.6% SWE-bench at 1/5 Opus cost). Haiku 4.5 for high-volume tasks like code review or test generation (73.3% SWE-bench at 1/15 Opus cost). In multi-agent setups, pairing a cheaper model with a specialized search subagent like WarpGrep often outperforms using a more expensive model alone.

What is Claude Opus 4.6's GPQA Diamond score?

91.3%. GPQA Diamond contains graduate-level science questions validated by domain experts. This is one of the highest scores on this benchmark across all models.

Build with Claude + WarpGrep

WarpGrep v2 lifted every model it was paired with by 2+ points on SWE-bench Pro. It runs in its own context window, issues 8 parallel tool calls per turn, and makes your coding agent 15.6% cheaper and 28% faster. Available through the Morph API.