SWE-Bench Explained: Benchmarks, Verified, Pro, and the 2026 Leaderboard

SWE-bench is the standard benchmark for AI coding agents. 500 verified tasks, 1,865 Pro tasks, and scores from 80.9% (Verified) to 59% (Pro). Here is how each variant works and what the scores actually mean.

March 4, 2026 ยท 1 min read

SWE-bench is the most widely cited benchmark for AI coding agents. It measures whether a model can resolve real GitHub issues by generating working patches. This guide covers the full SWE-bench family, the 2026 leaderboard, and the other benchmarks that matter.

Updated March 4, 2026
2,294
Original SWE-bench tasks
500
Verified (human-validated)
1,865
Pro tasks (4 languages)
80.9%
Top Verified score (Opus 4.5)

What is SWE-Bench?

SWE-bench was created by Carlos Jimenez, John Yang, and colleagues at Princeton Language and Intelligence in October 2023. The original paper asked a straightforward question: can language models resolve real-world GitHub issues?

The evaluation setup is simple. The model receives a code repository and an issue description. It must produce a patch (a set of file modifications) that resolves the issue. The patch is then tested against developer-written unit tests: fail-to-pass tests that should fail before the fix and pass after, plus pass-to-pass tests that must continue passing.

Evaluation environment

Models run inside barebones Linux Docker containers with no network access. All git history after the issue's creation date is stripped to prevent models from looking up the human solution. Dependencies are installed per the SWE-bench specification.

The original SWE-bench contains 2,294 tasks from 12 Python open-source repositories (Django, scikit-learn, Flask, Sphinx, and others). When it launched, the best systems solved around 4% of tasks. By late 2024, that number had climbed past 50%, prompting two important follow-ups: SWE-bench Verified and SWE-bench Pro.

SWE-Bench Lite

SWE-bench Lite is a 300-task subset of the original, selected for faster evaluation runs. It remains Python-only and shares the same contamination concerns as the full set. It is useful for quick iteration during agent development but should not be used as a primary evaluation metric.

SWE-Bench Verified: The "Gold Standard" (and Its Problems)

SWE-bench Verified was released by OpenAI and Princeton NLP as a curated subset of 500 tasks. Human annotators verified each task against three criteria: the issue description is unambiguous, the tests are reliable (no flaky tests), and the expected behavior is clearly defined.

The curation process filters out problematic tasks from the original set, such as issues requiring network access, tasks with ambiguous specifications, or repos with broken dependency configurations. Epoch AI runs independent evaluations on 484 of the 500 samples (16 excluded due to infrastructure issues).

Verified Leaderboard (March 2026)

RankModelScore
1Claude Opus 4.580.9%
2Claude Opus 4.680.8%
3MiniMax M2.5 (open-weight)80.2%
4GPT-5.280.0%
5Gemini 3 Flash78.0%
6GLM-577.8%
7Claude Sonnet 4.577.2%
8Kimi K2.576.8%
9Gemini 3 Pro76.2%
10GPT-5.174.9%

Scores are self-reported by model providers with varying scaffolding. Source: aggregated from swebench.com and provider announcements.

Contamination confirmed

OpenAI's audit found that every frontier model tested (GPT-5.2, Claude Opus 4.5, Gemini 3 Flash) could reproduce verbatim gold patches for certain Verified tasks. They also found that 59.4% of the hardest unsolved problems had flawed test cases. OpenAI stopped reporting Verified scores and recommends SWE-bench Pro instead.

SWE-Bench Pro: The Harder Successor

SWE-bench Pro was built by Scale AI to fix the limitations of Verified. It has 1,865 tasks across 41 repositories in Python, Go, TypeScript, and JavaScript. Every task requires at least 10 lines of changes. The average is 107 lines across 4.1 files.

Public Set (731 tasks)

Tasks from 11 GPL-licensed repositories. This is the primary evaluation target for leaderboard submissions. All scores below reference this set.

Commercial Set (276 tasks)

Tasks from 18 proprietary startup codebases acquired through Scale AI partnerships. Not publicly accessible, providing additional contamination resistance.

Held-Out Set (858 tasks)

Reserved for overfitting detection. Scale AI can release these to verify whether improvements on the public set generalize to unseen code.

Three-Stage Human Augmentation

Each Pro task goes through expert annotation:

  1. Problem statement creation: original commit messages and issue discussions are synthesized into structured descriptions
  2. Requirements definition: specification lists grounded in unit tests and gold patches, detailing expected behavior without prescribing implementation
  3. Interface specification: class and function signatures documented to prevent false negatives from naming mismatches

For a detailed breakdown of Pro scores, SEAL rankings, and agent system results, see the SWE-Bench Pro leaderboard page.

2026 Leaderboard: Verified vs Pro vs Agent Systems

There are three ways to read the leaderboard, and conflating them leads to confusion. SEAL scores use Scale AI's standardized scaffolding with a 250-turn limit. Agent system scores use custom scaffolding (Claude Code, Codex CLI, Auggie, Cursor). Verified scores use provider-specific setups. These numbers are not directly comparable.

RankModelScore
1Claude Opus 4.545.9%
2Claude Sonnet 4.543.6%
3Gemini 3 Pro43.3%
4Claude Sonnet 442.7%
5GPT-5 (High)41.8%
6GPT-5.2 Codex41.0%
7Claude Haiku 4.539.5%
8Qwen3 Coder 480B38.7%
9MiniMax 2.136.8%
10Gemini 3 Flash34.6%

Source: Scale AI SEAL Leaderboard. SEAL = Scale's Evaluation and Assessment Lab. 250-turn limit, standardized scaffolding.

AgentBase ModelScore
GPT-5.3-Codex (CLI)GPT-5.3-Codex57.0%
AuggieOpus 4.551.8%
CursorOpus 4.550.2%
Claude CodeOpus 4.555.4%

The gap between SEAL scores and agent system scores is instructive. Auggie, Cursor, and Claude Code all use the same base model (Opus 4.5) yet score between 50.2% and 55.4%. The SEAL score for that model is 45.9%. The difference is scaffolding: how the agent retrieves context, manages its context window, and orchestrates tool calls.

Scaffolding matters more than model choice

On SWE-bench Pro, switching from standardized scaffolding to a good agent system lifts the same model by 5-15 points. Context retrieval is the bottleneck, not raw model capability. This aligns with research showing coding agents spend 60%+ of their time searching rather than writing code.

SWE-Bench Variants Compared

VariantTasksLanguagesTop ScoreStatus
Original (Full)2,294Python~65%Active
Lite300Python~55%Active (fast eval)
Verified500Python80.9%Contaminated
Pro1,865Py, Go, TS, JS~59%Recommended
Multilingual3009 languages~45%Active
Live1,565+Multiple~40%Monthly updates
DimensionVerifiedPro
Tasks5001,865
Repositories12 (Python-only)41 (Python, Go, TS, JS)
Avg lines changed11 (median: 4)107.4
Avg files changed~14.1
Contamination resistanceLow (all public repos)High (GPL + proprietary)
Human annotationValidation only3-stage augmentation

161 of Verified's 500 tasks require only 1-2 lines of change. Every Pro task requires at least 10. The complexity difference explains why the same model scores 80%+ on Verified and under 46% on Pro with standardized scaffolding.

Other Coding Benchmarks Worth Knowing

SWE-bench measures repository-level bug fixing. Other benchmarks target different skills. A complete picture of coding agent capability requires looking at multiple evaluations.

BenchmarkWhat It TestsTasksTop ScoreSaturated?
SWE-bench ProRepo-level bug fixing1,865~59%No
Terminal-Bench 2.0Terminal/CLI agent tasks89~63%No
LiveCodeBenchCompetitive programming1,000+91.7%Partially
HumanEvalSingle-function Python16499.0%Yes
BigCodeBenchComplex function calls1,140~35%No
SWE-bench VerifiedRepo-level (Python only)50080.9%Contaminated

Terminal-Bench 2.0

Terminal-Bench 2.0 evaluates agents on 89 realistic terminal tasks: file manipulation, system administration, data processing, and debugging through command-line interfaces. Released in January 2026. Codex CLI with GPT-5.2 leads at 63%, followed by Terminus 2 with Claude Opus 4.5 at 58%. On the hard subset, top scores cluster around 53%.

Terminal-Bench tests a different axis than SWE-bench. An agent might excel at generating code patches but struggle with multi-step terminal workflows. The two benchmarks are complementary.

HumanEval and Its Successors

HumanEval, created by OpenAI in 2021, evaluates models on 164 Python function-writing tasks with hidden test cases. It was the first widely adopted coding benchmark. It is now saturated: Kimi K2.5 scores 99.0%, and most frontier models exceed 95%. A model scoring 95% on HumanEval tells you almost nothing about its ability to work in real codebases.

BigCodeBench was designed as a successor. It tests complex function calls with diverse library usage. Human performance on BigCodeBench is 97%. The best AI models score around 35%. LiveCodeBench uses fresh competitive programming problems to avoid contamination; Gemini 3 Pro Preview leads at 91.7%.

SWE-bench Live and Multilingual

SWE-bench Live adds new tasks monthly from recent GitHub activity, making contamination through training data nearly impossible. SWE-bench Multilingual extends to 9 languages with 300 tasks. Both are active community efforts with growing adoption.

Criticisms and Limitations

Benchmarks are proxies for real-world capability, and every proxy has failure modes. Four structural problems affect coding benchmarks broadly:

Data Contamination

Models trained on public code inevitably see benchmark tasks in their training data. OpenAI's audit confirmed that GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash can all reproduce verbatim patches for some SWE-bench Verified tasks. There are no industry standards for contamination detection, and each lab uses different thresholds. SWE-bench Pro mitigates this with proprietary codebases and a held-out set, but contamination remains an arms race.

Benchmark Gaming

Goodhart's Law in action: when a benchmark becomes a target, teams optimize for it. The original SWE-bench was Python-only, so teams tuned for Python-specific patterns. Agent scaffolding is often designed around benchmark-specific workflows rather than general-purpose problem solving. A model that reaches 59% on SWE-bench Pro with a custom scaffolding may underperform on unfamiliar repositories with different structures.

Saturation

HumanEval went from 13% (Codex, 2021) to 99% (Kimi K2.5, 2026) in under five years. SWE-bench Verified climbed from 4% to 80.9% in under three years. When top models cluster within a few percentage points of each other, the benchmark stops differentiating. SWE-bench Pro still has headroom (top is ~59%), but the history of other benchmarks suggests that headroom shrinks faster than expected.

Gap Between Benchmarks and Production

High benchmark scores can mask quality problems in real deployments. A model might solve 80% of Verified tasks but produce code with higher bug rates, miss edge cases, or fail on codebases structurally different from the benchmark repos. Benchmarks test whether the final patch passes tests. They do not measure code quality, maintainability, or whether the agent's approach would be acceptable in a code review.

What Benchmarks Tell You (and What They Don't)

Benchmarks Tell YouBenchmarks Don't Tell You
1Relative model ranking on specific tasksHow the model performs on your codebase
2Whether a model can generate working patchesWhether the patches are maintainable
3How much scaffolding matters (SEAL vs agent scores)Which scaffolding works best for your use case
4Where models fail (context overflow, semantic errors)How often those failures occur in your workflow
5Approximate ceiling of capability for a model familyCost-per-task for your specific workload

The most useful signal from SWE-bench is not the absolute score. It is the gap between SEAL scores and agent system scores. When Opus 4.5 jumps from 45.9% (SEAL) to 51.8% (Auggie), that 5.9-point lift comes entirely from better context retrieval and agent design. For teams building coding agents, this means investing in scaffolding (how you retrieve context, manage the context window, and orchestrate tool calls) yields higher returns than switching to a marginally better base model.

For more on how benchmarks apply to specific models, see Claude Benchmarks, Codex 5.3 vs Opus 4.6, and the full SWE-Bench Pro leaderboard.

Frequently Asked Questions

What is SWE-bench?

SWE-bench is a benchmark created by Princeton NLP researchers in October 2023 that evaluates AI coding agents on real GitHub issues. The model receives a codebase and an issue description, then must generate a patch that resolves the problem. The original dataset contains 2,294 Python tasks from 12 open-source repositories.

What is SWE-bench Verified?

A human-validated subset of 500 tasks. Annotators verified that each task has clear requirements and reliable tests. It was the gold standard for coding agent evaluation until OpenAI confirmed contamination across all frontier models. Top score: Claude Opus 4.5 at 80.9%.

What is SWE-bench Pro?

Scale AI's harder benchmark with 1,865 tasks across Python, Go, TypeScript, and JavaScript. Tasks require 107 lines changed across 4.1 files on average. It uses GPL and proprietary codebases to resist contamination. Top agent system score: ~59%. See the full Pro leaderboard.

Who has the highest SWE-bench score in 2026?

On Verified: Claude Opus 4.5 at 80.9%. On Pro (SEAL, standardized): Claude Opus 4.5 at 45.9%. On Pro (custom agent): GPT-5.3-Codex at 57.0%. These numbers are not directly comparable because evaluation setups differ.

Is HumanEval still useful?

Mostly no. Top models score 99%. It tests single-function Python generation, which does not reflect real software engineering. BigCodeBench and LiveCodeBench are more challenging successors.

What is Terminal-Bench 2.0?

A benchmark with 89 tasks for terminal-based agent work: file manipulation, system administration, debugging through CLIs. Frontier models score under 65%. It tests different capabilities from SWE-bench.

Are AI coding benchmarks reliable?

They have known limitations: contamination, gaming, and saturation. SWE-bench Pro addresses contamination with proprietary code and held-out sets. No benchmark fully predicts real-world performance. Use multiple benchmarks and weight the SEAL-vs-agent-system gap more than absolute scores.

What does SWE-bench tell me about choosing a coding agent?

Scaffolding matters more than model choice. The same model (Opus 4.5) scores 45.9% with standardized scaffolding and 51.8% with Auggie's scaffolding. If you are building or choosing a coding agent, invest in context retrieval and agent design. SWE-bench Pro is the best available proxy for production readiness, but test on your own codebase too.

Better Context Retrieval = Higher Benchmark Scores

WarpGrep v2 lifts every model it is paired with by 2-4 points on SWE-Bench Pro. It runs in its own context window, issues 8 parallel tool calls per turn, and makes coding agents 15.6% cheaper and 28% faster. The bottleneck is search, not generation.