Six benchmarks, six different leaderboards. This guide covers what each AI coding benchmark actually tests, the current scores for every frontier model, and which evaluation matters for your use case.
Leaderboard Summary: Models Across Benchmarks
No model wins every benchmark. The rankings shift depending on what you measure. This table gives a cross-benchmark view of the five most-cited frontier models.
| Model/Agent | SWE-Bench Verified | SWE-Bench Pro (SEAL) | Terminal-Bench | Aider Polyglot |
|---|---|---|---|---|
| Claude Opus 4.6 | ~80.9% | ~45.9%* | ~65.4% | ~85% |
| GPT-5.3 Codex | ~77.3%** | ~41.8%*** | ~77.3% | ~80% |
| Gemini 3.1 Pro | ~80.6% | ~43.3% | ~77.3% | ~78% |
| Qwen 3.5 | ~72% | ~38.7% | N/A | ~75% |
| DeepSeek V3.2 | ~73% | ~33% | N/A | ~72% |
*Opus 4.5 score on SEAL. **GPT-5.2 score (5.3 not separately reported on Verified). ***GPT-5 High on SEAL. Aider Polyglot scores are approximate from aider.chat leaderboard.
Claude Opus 4.6 leads SWE-Bench Verified, but trails GPT-5.3 Codex and Gemini 3.1 Pro on Terminal-Bench by 12 points. Gemini 3.1 Pro is competitive across all four benchmarks. DeepSeek V3.2 and Qwen 3.5 score lower across the board but cost a fraction per token.
SWE-Bench Verified
What it tests
Real GitHub issues from popular Python repos (Django, Matplotlib, Scikit-learn, etc.). ~500 human-validated tasks from the original 2,294. Each task is a real bug report paired with a test that must pass.
Who runs it
Princeton NLP Group, originally by Carlos E. Jimenez et al. The most-cited benchmark for AI coding agents since its introduction in 2023.
Strengths and limits
Real-world tasks from real repos. But Python-only, single-repo, and contaminated. OpenAI confirmed every frontier model shows training data leakage.
| Rank | Model | Score | Provider |
|---|---|---|---|
| 1 | Claude Opus 4.5 | 80.9% | Anthropic |
| 2 | Claude Opus 4.6 | 80.8% | Anthropic |
| 3 | Gemini 3.1 Pro | 80.6% | |
| 4 | MiniMax M2.5 | 80.2% | MiniMax |
| 5 | GPT-5.2 | 80.0% | OpenAI |
| 6 | Claude Sonnet 4.6 | 79.6% | Anthropic |
| 7 | Gemini 3 Flash | 78.0% | |
| 8 | Claude Sonnet 4.5 | 77.2% | Anthropic |
| 9 | Kimi K2.5 | 76.8% | Moonshot |
| 10 | Claude Haiku 4.5 | 73.3% | Anthropic |
| 11 | DeepSeek V3.2 | 73.0% | DeepSeek |
Contamination caveat
OpenAI confirmed that every frontier model shows training data contamination on SWE-Bench Verified, and 59.4% of the hardest unsolved tasks had flawed tests. OpenAI stopped reporting Verified scores and recommends SWE-Bench Pro instead. The scores above are useful for relative comparison but not as absolute measures of capability.
The top 6 models are separated by 1.3 points. At this level of compression, the ranking is as much about scaffold tuning and prompt engineering as about raw model capability. For a deeper look at SWE-Bench Verified methodology and history, see our SWE-Bench explainer.
SWE-Bench Pro
What it tests
Multi-file changes across larger codebases. 1,865 tasks requiring an average of 107 lines across 4.1 files. Multi-language: Python, JavaScript, TypeScript, Go, Rust, and more.
Who runs it
Scale AI, via the SEAL (Scale Evaluation of AI Language models) platform. The SEAL leaderboard uses standardized scaffolding with a 250-turn limit.
Why it matters
Resists contamination through GPL-licensed and proprietary codebases. Pass rates run 30-40% lower than Verified, which means the spread between models is wider and more informative.
SEAL Leaderboard (Standardized Scaffolding)
The SEAL leaderboard isolates model capability by holding the agent framework constant. All models get the same tools, same turn limit, same prompts.
| Rank | Model | Score | CI |
|---|---|---|---|
| 1 | Claude Opus 4.5 | 45.9% | ±3.60 |
| 2 | Claude Sonnet 4.5 | 43.6% | ±3.60 |
| 3 | Gemini 3 Pro | 43.3% | ±3.60 |
| 4 | Claude Sonnet 4 | 42.7% | ±3.59 |
| 5 | GPT-5 (High) | 41.8% | ±3.49 |
| 6 | GPT-5.2 Codex | 41.0% | ±3.57 |
| 7 | Claude Haiku 4.5 | 39.5% | ±3.55 |
| 8 | Qwen3 Coder 480B | 38.7% | ±3.55 |
Agent Systems (Custom Scaffolding)
When agents bring their own frameworks, context retrieval, and tool access, scores jump significantly. This measures the combined capability of model + agent system.
| Agent | Base Model | Score |
|---|---|---|
| GPT-5.3-Codex (CLI) | GPT-5.3-Codex | 57.0% |
| Auggie (Augment Code) | Opus 4.5 | 51.8% |
| Cursor | Opus 4.5 | 50.2% |
| Claude Code | Opus 4.5 | 55.4% |
The gap between the SEAL score (45.9%) and the best agent system (57.0%) is 11 points. Scaffolding matters. Augment Code's context engine, which handles 400K+ file codebases, adds ~6 points over the bare SEAL scaffold. For more detail on SWE-Bench Pro methodology, see our SWE-Bench Pro deep dive.
Subagent impact
In Morph internal benchmarks, adding WarpGrep v2 as a search subagent lifted Opus 4.6 from 55.4% to 57.5% on SWE-Bench Pro while making it 15.6% cheaper and 28% faster. The model matters, but the agent framework matters almost as much.
Terminal-Bench
What it tests
CLI-specific agent workflows: file editing, git operations, test running, multi-step debugging, environment setup. Tests whether agents can operate a computer through a terminal.
Who runs it
Published by the Terminal-Bench team (Paul Gauthier contributed, author of Aider). Tests are run in Docker containers with real terminal access.
Why it matters
SWE-Bench tests code changes. Terminal-Bench tests the full development workflow: read code, run tests, interpret output, fix issues, commit. This is closer to how developers actually use coding agents.
| Rank | Model/Agent | Score |
|---|---|---|
| 1 | GPT-5.3 Codex | ~77.3% |
| 2 | Gemini 3.1 Pro | ~77.3% |
| 3 | Claude Code (Opus 4.6) | ~72% |
| 4 | Aider (Opus 4.6) | ~67% |
| 5 | Codex CLI | ~65% |
Terminal-Bench rewards fast iteration and targeted command execution. The models that score highest issue short, precise terminal commands rather than planning long sequences. GPT-5.3 Codex and Gemini 3.1 Pro are tied at the top. Claude Code trails by ~5 points, though the gap has narrowed from earlier versions.
This is the benchmark where Claude has the most room to improve. On SWE-Bench Verified, the gap between Claude and the field is less than 1 point. On Terminal-Bench, it's 5+.
Aider Polyglot Benchmark
What it tests
Code generation across 133 problems in multiple languages: Python, JavaScript, TypeScript, Go, Rust, C++, Java, and more. Tests the model, not the agent.
Who runs it
Paul Gauthier, the creator of Aider. Published at aider.chat/docs/leaderboards. Uses Aider's edit format to measure how well models follow structured instructions.
Key distinction
This benchmark tests raw model capability across languages, not agent orchestration. A model that scores well here generates correct code but may still fail at multi-file agent tasks.
| Rank | Model | Score | Edit Format |
|---|---|---|---|
| 1 | Claude Opus 4.6 | ~85% | diff |
| 2 | Claude Sonnet 4.6 | ~82% | diff |
| 3 | GPT-5.3 | ~80% | diff |
| 4 | Gemini 3.1 Pro | ~78% | diff |
| 5 | DeepSeek V3.2 | ~72% | diff |
| 6 | Qwen 3.5 | ~75% | diff |
Claude Opus 4.6 leads the Aider Polyglot benchmark, consistent with its strong showing on SWE-Bench Verified. The edit format matters: models tested with Aider's diff format score higher than with whole-file replacement, because diff-based editing requires fewer tokens and reduces the chance of introducing unrelated changes.
Because Aider Polyglot tests model capability in isolation, it's useful for answering a specific question: which model generates the most correct code, independent of agent framework? If you're building your own coding agent and choosing a base model, this benchmark is more relevant than SWE-Bench Pro.
LiveCodeBench
What it tests
Competitive programming problems published after model training cutoffs. Problems sourced from LeetCode, Codeforces, and AtCoder. Harder than HumanEval by a wide margin.
Contamination resistance
The key advantage: because problems are sourced after training data cutoffs, models cannot have memorized solutions. This gives a cleaner signal of actual reasoning ability.
Limitations
Competitive programming is a narrow domain. Many problems reward algorithmic tricks rather than software engineering skills. Does not test multi-file editing, debugging, or real-world codebase navigation.
LiveCodeBench is the best benchmark for measuring genuine problem-solving on unseen tasks. The contamination problem that plagues HumanEval and (to a lesser extent) SWE-Bench Verified is largely solved here by using problems that did not exist when models were trained.
Scores are meaningfully lower than on HumanEval. Where frontier models hit 95%+ on HumanEval, LiveCodeBench pass rates for the hardest problems drop to 40-60%. This spread makes it easier to differentiate between models.
The trade-off: competitive programming problems don't map cleanly to real software engineering. A model that solves dynamic programming problems efficiently may still struggle with understanding a 50,000-line Django codebase. Use LiveCodeBench alongside SWE-Bench Pro, not as a replacement.
HumanEval & MBPP
HumanEval (164 problems)
OpenAI's original code generation benchmark. Given a function docstring, generate a correct implementation. Created in 2021. Every frontier model now scores 90%+.
MBPP (974 problems)
Google's Mostly Basic Python Programs. Similar to HumanEval but with 6x more problems. Also largely saturated at the frontier.
| Model | HumanEval | Notes |
|---|---|---|
| DeepSeek R1 | 96.1% | Highest reported |
| Claude Opus 4.6 | ~95.0% | |
| GPT-5.2 | ~95.0% | |
| Gemini 3.1 Pro | ~94% | |
| Codestral 25.01 | 86.6% | Open-weight |
HumanEval is saturated. The 1-point gap between the top models is noise, not signal. These benchmarks still appear in every model launch blog post because they were the original standard, but they no longer differentiate frontier models.
Beyond HumanEval
EvalPlus extends HumanEval with 80x more test cases per problem, catching false positives from solutions that pass narrow tests but fail on edge cases. HumanEval Pro tests self-invoking code generation, where even top models drop 20+ points. BigCodeBench adds realistic library usage. If you need to evaluate models on raw code generation, these are better tools than HumanEval in 2026.
What Benchmarks Don't Tell You
Benchmarks answer a narrow question under controlled conditions. Several things that matter in production are not measured by any major benchmark.
Cost and latency
No benchmark measures cost per task or time to completion. A model that scores 80% but costs $2 per task may be worse than one scoring 75% at $0.20. Agent-level benchmarks should report $/task alongside accuracy.
Contamination and gaming
Models train on public code. SWE-Bench Verified tasks come from popular open-source repos that are likely in every training set. Even Pro has contamination vectors. High scores may reflect memorization, not generalization.
Multi-agent orchestration
No benchmark tests multi-agent workflows: coordinator agents dispatching subagents, parallel execution across files, or hierarchical task decomposition. Anthropic's internal data shows 90% improvement from multi-agent setups.
Real workflow integration
Production coding involves PR reviews, CI feedback loops, reading documentation, communicating with teammates, and handling ambiguous requirements. No benchmark captures this end-to-end flow.
The benchmark gap
Cognition (Devin's team) measured that their agent spends 60% of its time on search and context retrieval, not code generation. No coding benchmark isolates or measures search efficiency. This is why two agents using the same model can score 10+ points apart on SWE-Bench Pro: the difference is in how they find and retrieve context, not in how they generate code.
The biggest blind spot: benchmarks test single-task completion. Real engineering productivity comes from an agent that handles 50 tasks across a day, each with different context. Context rot accumulates over long sessions, degrading performance in ways that single-task benchmarks never expose.
Which Benchmark Should You Trust?
| Benchmark | Best For | Watch Out For |
|---|---|---|
| SWE-Bench Pro | Evaluating agents for production SE work | Scores vary 10+ pts by scaffold |
| SWE-Bench Verified | Quick directional model comparison | Contaminated; saturated at top |
| Terminal-Bench | Evaluating DevOps and CLI agents | Limited coverage; newer benchmark |
| Aider Polyglot | Choosing a base model for your agent | Tests model only, not agent system |
| LiveCodeBench | Contamination-free reasoning test | Competitive programming != SE |
| HumanEval / MBPP | Baseline sanity check | Saturated; does not differentiate frontier |
If you can only look at one benchmark, make it SWE-Bench Pro on the SEAL leaderboard. It tests the most realistic tasks, resists contamination, and uses standardized scaffolding for fair comparison. Use the SEAL scores for model comparison and agent system scores for evaluating complete products.
If you're choosing a base model for a custom agent, pair SWE-Bench Pro (SEAL) with Aider Polyglot. The first tells you how the model performs in a standardized agent framework. The second tells you how well it generates correct code in isolation.
If you need to evaluate terminal and DevOps workflows specifically, Terminal-Bench is the only game in town. And if you need contamination-free evaluation of problem-solving ability, LiveCodeBench is the cleanest signal available.
Frequently Asked Questions
What is the most important AI coding benchmark in 2026?
SWE-Bench Pro is the best single benchmark for production coding agents. It tests multi-file, multi-language changes across 1,865 tasks averaging 107 lines across 4.1 files. The SEAL leaderboard standardizes scaffolding for fair comparison. Unlike SWE-Bench Verified, it resists contamination through GPL licensing and proprietary code.
Which AI model scores highest on coding benchmarks?
It depends on the benchmark. Claude Opus 4.6 leads SWE-Bench Verified at ~80.9%. GPT-5.3 Codex leads Terminal-Bench at ~77.3%. On SWE-Bench Pro (SEAL), Claude Opus 4.5 leads at ~45.9%. No single model wins every evaluation.
Is SWE-Bench Verified contaminated?
Yes. OpenAI confirmed that every frontier model shows training data leakage on SWE-Bench Verified, and 59.4% of the hardest unsolved tasks had flawed tests. OpenAI stopped reporting Verified scores. The scores are still directionally useful but should not be the sole basis for model selection.
What does the Aider Polyglot benchmark measure?
The Aider Polyglot benchmark tests raw model code generation across 133 problems in Python, JavaScript, TypeScript, Go, Rust, and other languages. Published at aider.chat. It measures the model in isolation, not the agent framework, making it useful for comparing base model capability.
How do AI coding benchmarks differ from each other?
Each benchmark tests a different slice of coding ability. SWE-Bench Verified: single-repo Python bug fixing (~500 tasks). SWE-Bench Pro: multi-file, multi-language engineering (~1,865 tasks). Terminal-Bench: CLI and terminal workflows. Aider Polyglot: raw model capability across languages. LiveCodeBench: post-training-cutoff competitive programming. HumanEval: isolated function generation (saturated). For a deeper comparison, see our guides on SWE-Bench and AI coding agents.
Build Faster Coding Agents with WarpGrep
WarpGrep v2 lifted every model it was paired with by 2+ points on SWE-Bench Pro. It runs in its own context window, issues 8 parallel tool calls per turn, and makes your coding agent 15.6% cheaper and 28% faster. Available through the Morph API.