SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81%

Claude Opus 4.5 scores 80.9% on SWE-Bench Verified and 45.9% on SWE-Bench Pro. Same model, half the score. The difference: Verified's 500 Python-only tasks are contaminated. Pro's 1,865 multi-language tasks are not.

Below are the latest rankings from Scale AI's SEAL leaderboard (standardized scaffolding), agent systems with custom scaffolding, and SWE-Bench Verified. The SEAL leaderboard is the controlled comparison. Agent system scores show what happens when scaffolding is optimized.

Leaderboard data verified March 1, 2026

SWE-Bench Pro: SEAL Leaderboard (Top 10)

Standardized scaffolding, 250-turn limit, 731 public tasks

1Opus 4.5

45.9%

2Sonnet 4.5

43.6%

3Gemini 3 Pro

43.3%

4Sonnet 4

42.7%

5GPT-5 (High)

41.8%

6GPT-5.2 Codex

41%

7Haiku 4.5

39.5%

8Qwen3 480B

38.7%

9MiniMax 2.1

36.8%

10Gemini 3 Flash

34.6%

Source: Scale AI SEAL Leaderboard. All models uncapped cost, 250-turn limit.

1,865

Tasks across 41 repositories

20+

Models ranked

Languages (Py, Go, TS, JS)

~59%

Top agent score (public set)

SEAL Leaderboard (Scale AI) →SWE-bench Official →OpenAI on Retiring Verified →

SEAL Leaderboard: SWE-Bench Pro (Standardized Scaffolding)

Scale AI runs every model through identical tooling with a 250-turn limit. This isolates raw model capability from scaffolding quality. SEAL stands for Scale's Evaluation and Assessment Lab. Scores below are from the public set (731 tasks).

The top 6 models are separated by 4.9 percentage points. Confidence intervals overlap for most adjacent pairs, meaning ranks 2 through 6 are statistically close. The gap widens below rank 10, where models drop below 30%.

Rank	Model	Score	CI
1	Claude Opus 4.5	45.9%	±3.60
2	Claude Sonnet 4.5	43.6%	±3.60
3	Gemini 3 Pro	43.3%	±3.60
4	Claude Sonnet 4	42.7%	±3.59
5	GPT-5 (High)	41.8%	±3.49
6	GPT-5.2 Codex	41.0%	±3.57
7	Claude Haiku 4.5	39.5%	±3.55
8	Qwen3 Coder 480B	38.7%	±3.55
9	MiniMax 2.1	36.8%	±3.55
10	Gemini 3 Flash	34.6%	±3.55
11	GPT-5.2	29.9%	±2.15
12	Kimi K2 Instruct	27.7%	±3.25
13	Qwen3 235B	21.4%	±2.25
14	GPT-OSS 120B	16.2%	±2.67
15	DeepSeek V3p2	15.6%	±2.63
16	Gemma 3 27B	11.4%	±2.15
17	Llama 3.1 405B	11.2%	±2.15
18	GLM-4.6	9.7%	±2.15
19	Llama 4 Maverick 17B	5.2%	±1.24
20	Codestral (2405)	1.5%	±1.51

Source: Scale AI SEAL Leaderboard. All models use uncapped cost with 250-turn limit unless noted. CI = 95% confidence interval.

Agent Systems Leaderboard (Custom Scaffolding)

Agent systems bring their own scaffolding: the framework wrapping the model, including tool access, context management, and turn limits. These scores are not directly comparable to SEAL scores because the scaffolding varies. All scores are on the SWE-Bench Pro public set (731 tasks).

The scaffolding gap is the most underappreciated finding in this data. Three different agent systems ran the same model (Opus 4.5), and their scores ranged from 50.2% to 55.4%. That spread comes entirely from how the agent manages context and tool calls.

Agent	Base Model	Score	Source
GPT-5.3-Codex (CLI)	GPT-5.3-Codex	57.0%	OpenAI
Auggie	Opus 4.5	51.8%	Augment Code
Cursor	Opus 4.5	50.2%	Cursor
Claude Code	Opus 4.5	55.4%	Anthropic

Scores are self-reported by each team. Opus 4.5 scores 45.9% on SEAL but 50.2-55.4% with custom scaffolding, a 4-10 point lift from better context retrieval alone.

WarpGrep Impact on SWE-Bench Pro (Morph Internal)

Self-reported data

The scores below are from Morph's internal benchmark runs, not from the SEAL leaderboard or independent third parties. They show the effect of adding WarpGrep v2 as a search subagent to existing coding agents.

SWE-Bench Pro: With vs Without WarpGrep v2

Morph internal benchmarks, public set (731 tasks)

With WarpGrep v2

Without WarpGrep

1Codex 5.3

59.1%

2MiniMax 2.5

57.6%

3Opus 4.6

57.5%

WarpGrep v2 adds 2.1-2.2 points to every model tested.

WarpGrep v2 is an RL-trained search subagent that runs in its own context window. It issues up to 8 parallel tool calls per turn and returns only the relevant file spans. The main coding model never sees files WarpGrep rejected, so its context stays clean.

With Opus 4.6, adding WarpGrep v2 cuts cost by 15.6% and time by 28%. The expensive model spends fewer tokens on search and more on code generation. Read how subagents make coding agents faster for the full breakdown.

SWE-Bench Verified Leaderboard (2026)

SWE-Bench Verified is a human-validated subset of 500 Python-only tasks from the original SWE-Bench. It remains widely cited, but OpenAI has stopped reporting Verified scores after finding that every frontier model showed training data contamination on the dataset.

Rank	Model	Score
1	Claude Opus 4.5	80.9%
2	Claude Opus 4.6	80.8%
3	MiniMax M2.5 (open-weight)	80.2%
4	GPT-5.2	80.0%
5	Gemini 3 Flash	78.0%
6	GLM-5	77.8%
7	Claude Sonnet 4.5	77.2%
8	Kimi K2.5	76.8%
9	Gemini 3 Pro	76.2%
10	GPT-5.1	74.9%
11	Grok 4	73.5%
12	Claude Haiku 4.5	73.3%
13	DeepSeek V3.2	73.0%
14	Claude Sonnet 4	72.7%
15	Qwen3 Coder Next	70.6%

Verified scores are self-reported by model providers. Scaffold and harness differences affect results. Source: aggregated from swebench.com and provider announcements.

SWE-Bench Variants Comparison

SWE-Bench has expanded into multiple benchmark variants. Each targets different aspects of software engineering evaluation.

Variant	Tasks	Languages	Top Score	Status
Original (Full)	2,294	Python	~65%	Active
Lite	300	Python	~55%	Active
Verified	500	Python	80.9%	Contaminated
Pro	1,865	Py, Go, TS, JS	~59%	Active (recommended)
Multilingual	300	9 languages	~45%	Active
Live	1,565+	Multiple	~40%	Monthly updates

Top scores are approximate and represent the best-known agent system result for each variant. "Contaminated" means OpenAI confirmed that frontier models have been trained on the test data.

SWE-Bench Pro vs SWE-Bench Verified

SWE-Bench Verified was the previous gold standard: a human-validated subset of 500 tasks from the original SWE-Bench. Pro was designed to fix its limitations.

Dimension	SWE-Bench Verified	SWE-Bench Pro
Tasks	500	1,865
Repositories	12 (all Python)	41 (Python, Go, TS, JS)
Avg lines changed	11 (median: 4)	107.4
Avg files changed	~1	4.1
Top score (Mar 2026)	80.9% (Claude Opus 4.5)	~59% (agent systems)
Contamination resistance	Low: all public repos	High: GPL + proprietary code
Task clarity	Ambiguous issues removed	Ambiguous issues clarified with human context

The difference in task complexity is stark. 161 of SWE-Bench Verified's 500 tasks require only 1-2 lines of change. Every SWE-Bench Pro task requires at least 10 lines. Over 100 tasks require more than 100 lines. These are tasks that would take a professional engineer hours to days.

Contamination confirmed

OpenAI's audit found that every frontier model tested (GPT-5.2, Claude Opus 4.5, Gemini 3 Flash) could reproduce verbatim gold patches or problem statement specifics for certain SWE-Bench Verified tasks. They also found that 59.4% of the hardest unsolved problems had flawed test cases. OpenAI has stopped reporting Verified scores and recommends SWE-Bench Pro instead.

How SWE-Bench Pro Works

SWE-Bench Pro contains 1,865 tasks across 41 actively maintained repositories spanning Python, Go, TypeScript, and JavaScript. The tasks come from real commit histories: consecutive commits where one resolves a bug or adds a feature, paired with tests that demonstrate the fix.

Three Subsets

Public Set (731 tasks)

Tasks from 11 GPL-licensed repositories, openly available on HuggingFace. This is the primary evaluation target for all leaderboard submissions.

Commercial Set (276 tasks)

Tasks from 18 proprietary startup codebases, acquired through partnerships with Scale AI. Not publicly accessible, providing additional contamination resistance.

Held-Out Set (858 tasks)

Tasks from 12 repositories reserved for overfitting detection. Scale AI can release these to verify whether improvements on the public set generalize to unseen code.

Three-Stage Human Augmentation

Each task goes through a rigorous annotation process:

Problem statement creation: original commit messages and issue discussions are synthesized into clear, structured descriptions
Requirements definition: annotators create specification lists grounded in unit tests and gold patches (the reference solution code), detailing expected behavior without prescribing implementation
Interface specification: class and function signatures are documented to prevent false negatives from naming mismatches

Evaluation methodology

Evaluation uses containerized, language-specific environments. Each task must pass fail2pass tests (tests that should fail before the fix and pass after, verifying the issue is resolved) and pass2pass tests (existing tests that must continue to pass, ensuring the fix does not break other functionality). Gold patches are validated across 3 test runs before inclusion in the benchmark.

Why Scores Are So Much Lower Than SWE-Bench Verified

The drop from 80.9% (Verified) to 57-59% (Pro) reflects four factors that compound on each other.

Multi-File Modifications

SWE-Bench Verified is largely a single-file benchmark. Most fixes touch one file with a few lines changed. SWE-Bench Pro tasks require coordinating changes across an average of 4.1 files. The agent needs to understand how a change in one file affects behavior in three others.

Longer Time Horizons

These are tasks that would take a professional engineer hours to days. The agent must maintain coherent plans across many steps, managing context and state throughout.

Codebase Complexity

Pro repositories are production systems: business applications, B2B services, developer tools. They have complex build systems, cross-cutting concerns, and domain-specific conventions that an agent must navigate.

Contamination Resistance

Models cannot rely on having seen the code before. The GPL licensing and proprietary repos mean agents must genuinely reason about unfamiliar codebases, not recall solutions from training data.

Failure mode analysis

Scale AI's analysis of agent trajectories reveals where models break down: semantic understanding failures (35.9% of Opus 4.1 failures), context overflow (35.6% of Sonnet 4 failures), and tool-use inefficiency (42% of smaller model failures). Context overflow is the dominant failure mode for the strongest models, which aligns with research showing coding agents spend 60%+ of their time searching for context.

Frequently Asked Questions

What is SWE-Bench Pro?

SWE-Bench Pro is a software engineering benchmark by Scale AI that evaluates AI coding agents on 1,865 long-horizon tasks from 41 real repositories across Python, Go, TypeScript, and JavaScript. Tasks require an average of 107 lines of changes across 4.1 files.

What is Claude Opus 4.5's SWE-Bench Pro score?

Claude Opus 4.5 scores 45.9% on the SEAL leaderboard with standardized scaffolding, the highest of any model. On SWE-Bench Verified (a different, contaminated benchmark), Opus 4.5 leads at 80.9%. When paired with WarpGrep v2 as a search subagent, Opus 4.6 reaches 57.5% on Pro (Morph internal benchmark).

What is GPT-5.3-Codex's SWE-Bench Pro score?

GPT-5.3-Codex scores 57% on SWE-Bench Pro according to OpenAI's published results. On the SEAL leaderboard with standardized scaffolding, GPT-5 (High) scores 41.8% and GPT-5.2 Codex scores 41.0%. The gap shows the impact of OpenAI's custom agent scaffolding vs. Scale AI's standardized environment.

What is the difference between SEAL scores and agent system scores?

SEAL scores use Scale AI's unified scaffolding (SEAL = Scale's Evaluation and Assessment Lab) with a 250-turn limit, providing a controlled comparison across models. Agent system scores use custom scaffolding (Auggie, Cursor, Claude Code, WarpGrep configurations) with specialized context retrieval and other optimizations. Agent systems consistently score 5-15 points higher than the same base model on the SEAL leaderboard.

How often is SWE-Bench Pro updated?

Scale AI adds new model evaluations to the SEAL leaderboard as they become available. The benchmark has a held-out set of 858 tasks that can be released to detect overfitting. Agent system scores are reported by individual teams and may update independently.

How does SWE-Bench Pro differ from SWE-Bench Verified?

SWE-Bench Verified has 500 Python-only tasks with small fixes (median 4 lines). SWE-Bench Pro has 1,865 multi-language tasks requiring substantial, multi-file modifications (average 107 lines across 4.1 files). Pro uses GPL licensing and proprietary codebases to resist data contamination. OpenAI has stopped reporting Verified scores due to confirmed contamination.

Is SWE-Bench Verified still useful?

SWE-Bench Verified still differentiates between weaker models and runs faster. But OpenAI's audit found that all frontier models are contaminated on it, and 59.4% of hard tasks have flawed tests. OpenAI has stopped reporting Verified scores. SWE-Bench Pro is a better measure of production readiness.

WarpGrep v2: Search Subagent for SWE-Bench Pro

WarpGrep v2 is the RL-trained search subagent that lifted every model it was paired with by 2+ points on SWE-Bench Pro. It runs in its own context window, issues 8 parallel tool calls per turn, and makes your coding agent 15.6% cheaper and 28% faster.

Try WarpGrep

Read the WarpGrep v2 Post

Fast Apply

WarpGrep

Compact

Model Router

DeepSeek

MiniMax

Qwen

Blog

Startup Credits

Students

Contact Us

About

Careers