Every published benchmark score for Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5 in one place. Comparison tables against GPT-5.3-Codex, Gemini 3.1 Pro, DeepSeek, and the rest of the frontier.
Claude Model Overview
Anthropic ships three Claude tiers. Opus is the flagship for hard tasks. Sonnet is the workhorse that closes within a few points of Opus on most benchmarks at 1/5 the cost. Haiku is the speed tier for high-volume and latency-sensitive work.
| Model | Released | Context | Input Price | Output Price |
|---|---|---|---|---|
| Claude Opus 4.6 | Feb 2026 | 1M tokens | $15/M | $75/M |
| Claude Sonnet 4.6 | Feb 2026 | 1M tokens | $3/M | $15/M |
| Claude Haiku 4.5 | Oct 2025 | 200K tokens | $0.80/M | $4/M |
SWE-bench Verified Scores
SWE-bench Verified contains 500 human-validated Python tasks from real GitHub repositories (Django, Matplotlib, Scikit-learn, etc.). Each task is a real bug report or feature request paired with tests. Models are scored on the percentage of tasks they resolve correctly.
Contamination caveat
OpenAI confirmed that every frontier model shows training data contamination on SWE-bench Verified. They stopped reporting Verified scores and recommend SWE-bench Pro instead. The scores below are still useful for relative comparison, but treat the absolute numbers with skepticism.
| Rank | Model | Score | Provider |
|---|---|---|---|
| 1 | Claude Opus 4.5 | 80.9% | Anthropic |
| 2 | Claude Opus 4.6 | 80.8% | Anthropic |
| 3 | Gemini 3.1 Pro | 80.6% | |
| 4 | MiniMax M2.5 | 80.2% | MiniMax |
| 5 | GPT-5.2 | 80.0% | OpenAI |
| 6 | Claude Sonnet 4.6 | 79.6% | Anthropic |
| 7 | Sonar Foundation Agent | 79.2% | Sonar |
| 8 | Gemini 3 Flash | 78.0% | |
| 9 | GLM-5 | 77.8% | Zhipu AI |
| 10 | Claude Sonnet 4.5 | 77.2% | Anthropic |
| 11 | Kimi K2.5 | 76.8% | Moonshot |
| 12 | Gemini 3 Pro | 76.2% | |
| 13 | GPT-5.1 | 74.9% | OpenAI |
| 14 | Grok 4 | 73.5% | xAI |
| 15 | Claude Haiku 4.5 | 73.3% | Anthropic |
| 16 | DeepSeek V3.2 | 73.0% | DeepSeek |
| 17 | Claude Sonnet 4 | 72.7% | Anthropic |
Scores are self-reported by model providers. Scaffold and harness differences affect results. Source: aggregated from swebench.com and provider announcements.
Six of the top 17 entries are Claude models. Opus 4.5 and 4.6 trade the #1 and #2 spots with a 0.1 point margin. Sonnet 4.6 at #6 (79.6%) costs $3/$15 per million tokens, undercutting every model above it except Gemini 3 Flash. Haiku 4.5 at #15 (73.3%) beats GPT-5.1 and most open-source models while costing $0.80/$4.00.
SWE-bench Pro Scores
SWE-bench Pro is Scale AI's harder benchmark: 1,865 multi-language tasks requiring an average of 107 lines across 4.1 files. It resists contamination through GPL licensing and proprietary codebases.
SEAL Leaderboard (Standardized Scaffolding)
The SEAL leaderboard uses Scale AI's unified scaffolding with a 250-turn limit. This isolates model capability by holding the agent framework constant.
| Rank | Model | Score | CI |
|---|---|---|---|
| 1 | Claude Opus 4.5 | 45.9% | ±3.60 |
| 2 | Claude Sonnet 4.5 | 43.6% | ±3.60 |
| 3 | Gemini 3 Pro | 43.3% | ±3.60 |
| 4 | Claude Sonnet 4 | 42.7% | ±3.59 |
| 5 | GPT-5 (High) | 41.8% | ±3.49 |
| 6 | GPT-5.2 Codex | 41.0% | ±3.57 |
| 7 | Claude Haiku 4.5 | 39.5% | ±3.55 |
| 8 | Qwen3 Coder 480B | 38.7% | ±3.55 |
Claude takes four of the top seven slots. Opus 4.5 leads at 45.9%, a 2.3-point gap over the next model. Haiku 4.5 at 39.5% outperforms GPT-5.2 Codex (41.0% is within the confidence interval) while costing a fraction per token.
Agent Systems (Custom Scaffolding)
Agent systems bring their own frameworks, including specialized context retrieval, longer turn limits, and tool access. These scores are not directly comparable to SEAL scores.
| Agent | Base Model | Score |
|---|---|---|
| GPT-5.3-Codex (CLI) | GPT-5.3-Codex | 57.0% |
| Auggie | Opus 4.5 | 51.8% |
| Cursor | Opus 4.5 | 50.2% |
| Claude Code | Opus 4.5 | 55.4% |
The gap between Auggie (51.8%) and Claude Code (55.4%), both running Opus 4.5, shows that scaffolding matters on SWE-bench Pro. The model matters, but the agent framework matters almost as much.
WarpGrep impact
In Morph internal benchmarks, adding WarpGrep v2 as a search subagent lifted Opus 4.6 from 55.4% to 57.5% on SWE-bench Pro while making it 15.6% cheaper and 28% faster. WarpGrep is an RL-trained search model that runs in its own context window and issues up to 8 parallel tool calls per turn.
Terminal-Bench 2.0 Scores
Terminal-Bench evaluates AI models on complex terminal tasks: system administration, environment management, debugging, and multi-step operations in a live terminal. Unlike SWE-bench, which focuses on code changes, Terminal-Bench tests whether a model can operate a computer through a command line.
| Rank | Model | Score |
|---|---|---|
| 1 | GPT-5.3-Codex | 77.3% |
| 2 | Gemini 3.1 Pro | 77.3% |
| 3 | Claude Opus 4.6 | 65.4% |
This is one benchmark where Claude trails. GPT-5.3-Codex and Gemini 3.1 Pro both score 77.3%, 12 points above Opus 4.6. Terminal-Bench rewards fast iteration and tool-use efficiency. The models that perform best here tend to issue shorter, more targeted commands rather than planning long sequences.
HumanEval & MBPP Scores
HumanEval (164 problems) and MBPP (974 problems) are the original code generation benchmarks. They test isolated function-level coding: given a docstring, generate a correct implementation. Every frontier model now scores 90%+ on HumanEval, which limits its ability to differentiate.
| Model | HumanEval | Notes |
|---|---|---|
| DeepSeek R1 | 96.1% | Highest reported |
| Claude Opus 4.6 | 95.0% | |
| GPT-5.2 | 95.0% | |
| Codestral 25.01 | 86.6% | Open-weight |
HumanEval is largely saturated at the frontier. The 1-point gap between DeepSeek R1 (96.1%) and Claude Opus 4.6 (95.0%) is not meaningful in practice. For distinguishing between top models, SWE-bench and Terminal-Bench are more informative because they test multi-file, real-world tasks rather than isolated functions.
Beyond HumanEval
EvalPlus extends HumanEval with 80x more test cases per problem, reducing false positives from solutions that pass narrow tests but fail on edge cases. HumanEval Pro tests self-invoking code generation, where even o1-mini drops from 96.2% to 76.2%. BigCodeBench adds realistic library usage and API calls. If you need to evaluate models on code generation specifically, these are better tools than raw HumanEval.
Reasoning & Knowledge Benchmarks
Coding is not just pattern matching. Hard bugs require reasoning about state, constraints, and interactions across systems. These benchmarks measure the reasoning capability that underpins coding performance.
| Benchmark | Score | What It Measures |
|---|---|---|
| GPQA Diamond | 91.3% | Graduate-level science (physics, chemistry, biology) validated by domain experts |
| ARC-AGI-2 | 68.8% | Abstract reasoning and novel pattern recognition |
| MMMU Pro | 77.3% | Multimodal reasoning across academic disciplines (with tools) |
| BigLaw Bench | 90.2% | Complex legal reasoning and contract analysis |
| Humanity's Last Exam | Top score | Expert-level questions across all domains |
| MRCR v2 | 76.0% | Long-context retrieval (vs 18.5% for Opus 4.5) |
Sonnet 4.6 Reasoning
| Benchmark | Score | Notes |
|---|---|---|
| ARC-AGI-2 | 58.3% | 4.3x improvement over Sonnet 4.5 (13.6%) |
| GPQA Diamond | 74.1% | Opus 4.6 leads at 91.3% |
| SWE-bench Verified | 79.6% | Within 1.2 points of Opus 4.6 |
| OSWorld-Verified | 72.5% | Nearly matches Opus 4.6 on computer use |
| MCP-Atlas (tool use) | 61.3% | Ahead of Opus 4.6 (60.3%) |
| Finance Agent | 63.3% | Office productivity leader at 1633 Elo |
Sonnet 4.6's ARC-AGI-2 score (58.3%) is the standout: a 4.3x leap over Sonnet 4.5 (13.6%), the largest single-generation gain on this benchmark. It also edges ahead of Opus 4.6 on MCP-Atlas, a tool-use benchmark, suggesting Sonnet 4.6 has been specifically tuned for agentic tool calling.
Claude vs GPT-5 vs Gemini: Head-to-Head
No single model wins every benchmark. Claude leads on SWE-bench Verified and GPQA. GPT-5.3-Codex leads on Terminal-Bench and SWE-bench Pro (with custom scaffolding). Gemini 3.1 Pro matches GPT-5.3 on Terminal-Bench and leads the SEAL leaderboard behind Claude.
| Benchmark | Claude Opus 4.6 | GPT-5.3-Codex | Gemini 3.1 Pro |
|---|---|---|---|
| SWE-bench Verified | 80.8% | 80.0%* | 80.6% |
| SWE-bench Pro (SEAL) | 45.9%** | 41.8%*** | 43.3% |
| SWE-bench Pro (agent) | 57.5%**** | 57.0% | N/A |
| Terminal-Bench 2.0 | 65.4% | 77.3% | 77.3% |
| HumanEval | 95.0% | 95.0% | N/A |
| GPQA Diamond | 91.3% | N/A | N/A |
| ARC-AGI-2 | 68.8% | N/A | N/A |
| OSWorld-Verified | 72.7% | 64.7% | N/A |
*GPT-5.2 score (5.3 not separately reported on Verified). **Opus 4.5 score on SEAL. ***GPT-5 High on SEAL. ****With WarpGrep v2, Morph internal.
The takeaway: at the frontier, model choice depends on task shape. For multi-file bug fixing and code review, Claude Opus 4.6 has a slight edge. For terminal-heavy development with fast iteration, GPT-5.3-Codex dominates. For cost-sensitive production workloads, Gemini 3.1 Pro competes on most benchmarks at lower cost.
Cost-Adjusted Comparison
| Model | Input | Output | SWE-bench Verified |
|---|---|---|---|
| Claude Opus 4.6 | $15 | $75 | 80.8% |
| Claude Sonnet 4.6 | $3 | $15 | 79.6% |
| GPT-5.3-Codex | $10 | $40 | ~80.0% |
| Gemini 3.1 Pro | $2.50 | $15 | 80.6% |
| Claude Haiku 4.5 | $0.80 | $4 | 73.3% |
| DeepSeek V3.2 | $0.55 | $2.19 | 73.0% |
Sonnet 4.6 and Gemini 3.1 Pro are the most cost-effective frontier options. Both score within 1.2 points of Opus 4.6 on SWE-bench Verified at roughly 1/5 the token cost. For budget-constrained applications, Haiku 4.5 and DeepSeek V3.2 offer 73%+ SWE-bench performance at less than $5/M output tokens.
Historical Progression: Claude SWE-bench Scores
Claude's SWE-bench Verified scores have more than doubled in two years.
| Model | Release | SWE-bench Verified | Jump |
|---|---|---|---|
| Claude 3 Opus | Mar 2024 | 33.4% | Baseline |
| Claude 3.5 Sonnet | Jun 2024 | 49.0% | +15.6 |
| Claude 3.7 Sonnet | Feb 2025 | 62.3% | +13.3 |
| Claude Sonnet 4.5 | Oct 2025 | 77.2% | +14.9 |
| Claude Opus 4.5 | Oct 2025 | 80.9% | N/A |
| Claude Sonnet 4.6 | Feb 2026 | 79.6% | +2.4 vs Sonnet 4.5 |
| Claude Opus 4.6 | Feb 2026 | 80.8% | -0.1 vs Opus 4.5 |
The trajectory: 33.4% to 80.8% in two years. The most dramatic gains came between Claude 3 Opus and Sonnet 4.5, where each generation added 13-16 points. The 4.5-to-4.6 jump is smaller (0.2-2.4 points on Verified) because Anthropic focused the 4.6 generation on other capabilities: 1M context, long-context retrieval (MRCR v2 jumped from 18.5% to 76%), and agentic tool use.
SWE-bench Verified may also be approaching a ceiling. Contamination, test flaws, and benchmark saturation all compress the gap between frontier models. On SWE-bench Pro, which resists these issues, the spread is much wider (45.9% vs. 15.6% between top and bottom), leaving more room for differentiation.
Which Benchmark Matters Most for Real Coding?
Benchmarks measure different things. Choosing the wrong one gives you a misleading picture of model capability.
| Benchmark | What It Tests | When to Use It |
|---|---|---|
| SWE-bench Pro | Multi-file, multi-language bug fixing in real repos | Evaluating agents for production software engineering |
| SWE-bench Verified | Single-file Python bug fixing | Quick directional comparison (caveat: contaminated) |
| Terminal-Bench 2.0 | Terminal commands, sysadmin, env management | Evaluating DevOps and infrastructure agents |
| HumanEval / MBPP | Isolated function generation | Baseline sanity check (saturated at frontier) |
| GPQA Diamond | Graduate-level science reasoning | Evaluating reasoning depth for hard debugging |
| ARC-AGI-2 | Novel pattern recognition | Measuring generalization to unseen problem types |
For coding agents that will operate on real codebases, SWE-bench Pro is the best single benchmark. It tests the full pipeline: understanding a codebase, locating relevant files, reasoning about the fix, and implementing it across multiple files. The SEAL leaderboard's standardized scaffolding makes scores directly comparable.
Terminal-Bench matters if your use case involves infrastructure, deployment, or debugging in a terminal. HumanEval is a sanity check, not a differentiator. GPQA and ARC-AGI-2 matter if you need the model to reason about novel problems rather than apply known patterns.
No single number captures coding ability. A model that scores 80% on SWE-bench Verified but 15% on SWE-bench Pro (like some smaller models) is memorizing solutions, not solving problems. Look at the full picture.
Frequently Asked Questions
What is Claude Opus 4.6's SWE-bench Verified score?
Claude Opus 4.6 scores 80.8% on SWE-bench Verified. Opus 4.5 is marginally higher at 80.9%. Both were tested using a simple scaffold with bash and file editing tools.
What is Claude Opus 4.6's SWE-bench Pro score?
On the SEAL leaderboard, Claude Opus 4.5 leads at 45.9% with standardized scaffolding. When paired with WarpGrep v2, Opus 4.6 reaches 57.5% (Morph internal benchmark).
How does Claude compare to GPT-5 on coding benchmarks?
On SWE-bench Verified, Claude Opus 4.6 (80.8%) and GPT-5.3-Codex (~80.0%) are within 1 point. On SWE-bench Pro with custom scaffolding, GPT-5.3-Codex leads at 57%. On Terminal-Bench 2.0, GPT-5.3-Codex scores 77.3% vs. Opus 4.6's 65.4%. On GPQA Diamond, Opus 4.6 leads at 91.3%. Different benchmarks favor different models.
What is Claude Sonnet 4.6's SWE-bench score?
Sonnet 4.6 scores 79.6% on SWE-bench Verified, within 1.2 points of Opus 4.6 at 1/5 the cost ($3/$15 per million tokens vs. $15/$75).
What is Claude Haiku 4.5's SWE-bench score?
Haiku 4.5 scores 73.3% on SWE-bench Verified and 39.5% on SWE-bench Pro (SEAL). At $0.80/$4.00 per million tokens, it reaches 90% of Sonnet 4.5's performance in agentic coding evaluations.
Is SWE-bench Verified still a reliable benchmark?
Partially. OpenAI confirmed that every frontier model shows training data contamination on the dataset, and 59.4% of the hardest unsolved tasks have flawed tests. It still differentiates between weaker models and runs quickly, but SWE-bench Pro is a better measure of production readiness.
What does Terminal-Bench measure?
Terminal-Bench 2.0 evaluates AI models on complex terminal and system administration tasks: using a live terminal for environment management, debugging, and multi-step operations. GPT-5.3-Codex and Gemini 3.1 Pro lead at 77.3%, with Claude Opus 4.6 at 65.4%.
Which Claude model should I use for coding?
Opus 4.6 for maximum accuracy on hard, multi-file tasks. Sonnet 4.6 for everyday coding (79.6% SWE-bench at 1/5 Opus cost). Haiku 4.5 for high-volume tasks like code review or test generation (73.3% SWE-bench at 1/15 Opus cost). In multi-agent setups, pairing a cheaper model with a specialized search subagent like WarpGrep often outperforms using a more expensive model alone.
What is Claude Opus 4.6's GPQA Diamond score?
91.3%. GPQA Diamond contains graduate-level science questions validated by domain experts. This is one of the highest scores on this benchmark across all models.
Build with Claude + WarpGrep
WarpGrep v2 lifted every model it was paired with by 2+ points on SWE-bench Pro. It runs in its own context window, issues 8 parallel tool calls per turn, and makes your coding agent 15.6% cheaper and 28% faster. Available through the Morph API.