SWE-bench Pro Leaderboard (2026): Every Model Score, Opus 4.8 Leads Active at 69.2%

SWE-bench Pro is Scale AI's contamination-resistant coding benchmark: 1,865 real-world software tasks across 41 professional repositories, scored Pass@1, that the same frontier models clearing 80-95% on SWE-bench Verified solve only ~59% of under standardized scaffolding. The standardized leader on June 28, 2026 is GPT-5.4 (xHigh) at 59.1% on Scale's public set; the top active model in the llm-stats vendor aggregate is Claude Opus 4.8 at 69.2%.

Three numbers all claim to be the best SWE-bench Pro score: 59.1% (GPT-5.4 xHigh, Scale's standardized public set), 69.2% (Claude Opus 4.8, the llm-stats vendor aggregate), and 47.1% (Claude Opus 4.6, Scale's private commercial set). All three are real. The spread is scaffolding and data splits, and most pages quoting a score never say which one they mean. The all-time high of 80.0% belongs to Claude Fable 5, but that model is currently suspended (see note below), so the leading active model is Opus 4.8.

This page keeps all three views side by side: Scale's standardized public and commercial leaderboards, vendor-reported scores, the Pro-vs-Verified delta per model, and score per dollar of output-token price.

Leaderboard data verified June 28, 2026

Updated June 28, 2026: every leaderboard refreshed to current models (GPT-5.6 preview, GPT-5.5, Opus 4.8, GLM-5.2, DeepSeek V4, MiniMax M3, Qwen 3.6/3.7, Kimi K2.6), vendor scores relabeled as the llm-stats aggregate, the Scale standardized public set re-sourced, and the Claude Fable 5 / Mythos 5 export-control suspension added.

SWE-bench Pro: Scale Standardized Leaderboard Top 10 (Public Set)

Scale AI standardized scaffolding, Pass@1, 731 public tasks

GPT-5.4 (xHigh)

59.1%

Muse Spark (Meta)

55%

Opus 4.6 (thinking)

51.9%

Gemini 3.1 Pro

46.1%

Gemini 3 Pro

43.3%

GPT-5.2 Codex

41%

Claude Haiku 4.5

39.5%

Qwen3-Coder 480B (open)

38.7%

Gemini 3 Flash

34.6%

Kimi K2 Instruct (open)

27.7%

Source: Scale AI standardized SWE-bench Pro public leaderboard, June 2026. Pass@1; GPT-5.4 (xHigh) 95% CI is ±3.56.

Current top SWE-bench Pro score (Scale standardized public set): GPT-5.4 (xHigh) at 59.1%. The best score in the llm-stats vendor aggregate from an active model is Claude Opus 4.8 at 69.2% (Fable 5's 80.0% is higher but suspended). The best open-weights result is GLM-5.2 at 62.1% (vendor aggregate, not a Scale standardized entry).

1,865

Tasks across 41 repositories

Pass@1

Scoring metric

Languages (Py, Go, TS, JS)

107.4

Avg lines changed per task

SEAL Public Leaderboard (Scale) →SEAL Private Leaderboard →SWE-bench Official →OpenAI on Retiring Verified →SWE-bench Pro Paper (arXiv) →GitHub (scaleapi/SWE-bench_Pro-os) →HuggingFace Dataset →

SWE-bench Pro Leaderboard: Scale SEAL Public Set

Scale AI runs every model through identical scaffolding, which isolates model capability from harness quality. These are the only directly comparable SWE-bench Pro numbers. Scores below are from the public set (731 tasks), Pass@1, as of June 2026.

GPT-5.4 (xHigh) leads at 59.1%, 4.1 points ahead of Meta's Muse Spark (55.0%) and 7.2 ahead of the best Claude run (Opus 4.6 thinking, 51.9%). Confidence intervals are roughly ±3.5 points, so adjacent ranks below the top three overlap.

Scale Standardized SWE-bench Pro Public Set (June 2026)

Rank	Model	Score	Type
1	GPT-5.4 (xHigh)	59.1%	Proprietary
2	Muse Spark (Meta)	55.0%	Proprietary
3	Claude Opus 4.6 (thinking)	51.9%	Proprietary
4	Gemini 3.1 Pro (thinking)	46.1%	Proprietary
5	Gemini 3 Pro (preview)	43.3%	Proprietary
6	GPT-5.2 Codex	41.0%	Proprietary
7	Claude Haiku 4.5	39.5%	Proprietary
8	Qwen3-Coder 480B-A35B	38.7%	Open weights
9	Gemini 3 Flash	34.6%	Proprietary
10	Kimi K2 Instruct	27.7%	Open weights

Source: Scale AI standardized SWE-bench Pro public leaderboard, June 2026. Pass@1; 95% CIs are roughly ±3.5 points. Claude Fable 5, Opus 4.8, GPT-5.5, and GLM-5.2 have no standardized public-set entry, so their numbers in the next sections are vendor-reported.

SWE-bench Pro Commercial Set: Scores on Code No Model Has Seen

The commercial set is 276 tasks from 18 proprietary startup codebases that are not on the public internet. It is the strongest contamination control available, and scores drop hard: every model loses ground versus its public-set number, and the ranking reshuffles.

Scale Standardized SWE-bench Pro Private Set (June 2026)

Rank	Model	Private Score	Public-Set Score
1	Claude Opus 4.6 (thinking)	47.1%	51.9%
2	Muse Spark (Meta)	44.7%	55.0%
3	GPT-5.4 (xHigh)	43.4%	59.1%
4	Gemini 3.1 Pro (thinking)	32.2%	46.1%
5	GPT-5.2 Codex	27.7%	41.0%
6	Gemini 3 Pro	18.0%	43.3%

Source: Scale AI standardized private leaderboard, June 2026. The 276-task set carries wider confidence intervals (roughly ±5 to 6 points). Opus 4.6 (47.1%) and Muse Spark (44.7%) verified against secondary trackers; lower-ranked rows are Scale's last-published private figures.

The reshuffle is the interesting part. GPT-5.4 leads the public set by 4.1 points but falls to third on commercial code (43.4%). Opus 4.6 holds 47.1%, the top private score, losing only 4.8 points off its public number, while Gemini 3.1 Pro drops 13.9 points (46.1% to 32.2%). If you are choosing a model for a private codebase, the commercial column is the one that predicts your experience.

Vendor-Reported SWE-bench Pro Scores: Opus 4.8 Leads Active at 69.2%

Labs run SWE-bench Pro on their own agent scaffolds, with tuned context retrieval, tool use, and turn budgets, and llm-stats aggregates those self-reported numbers into one table. These are not comparable to the Scale standardized leaderboard, but they are comparable to each other. Claude Fable 5 holds the all-time high at 80.0%, but it is suspended (see note above), so Claude Opus 4.8 at 69.2% is the leading active model.

SWE-bench Pro: llm-stats Vendor Aggregate (June 2026)

Model	Pro Score	Output Price
Claude Fable 5 (suspended, see note)	80.0%	$50/M
Claude Mythos Preview (suspended, see note)	77.8%	$50/M
Claude Opus 4.8	69.2%	$25/M
Claude Opus 4.7	64.3%	n/a
GLM-5.2 (open weights)	62.1%	$4.40/M
Qwen3.7 Max (open weights)	60.6%	$3.75/M
MiniMax M3 (open weights)	59.0%	$2.40/M
GPT-5.5	58.6%	$30/M
Kimi K2.6 (open weights)	58.6%	$4.00/M
GPT-5.3 Codex	56.8%	$14/M
Qwen3.6 Plus (open weights)	56.6%	$3.00/M
MiniMax M2.7 (open weights)	56.2%	$0.72/M
DeepSeek-V4-Pro-Max (open weights)	55.4%	$0.87/M
Gemini 3.1 Pro	54.2%	$12/M

Source: llm-stats SWE-bench Pro aggregate, June 2026 (vendor self-reported; GLM-5.2's 62.1% is third-party measured, as Z.ai published no SWE-bench number at launch). Output prices from each vendor's API price list, including the Anthropic page.

GPT-5.6 (Sol / Terra / Luna): not rankable yet

OpenAI announced GPT-5.6 on June 26, 2026 with three variants (Sol flagship, Terra balanced, Luna fast). It is a gated limited preview: rollout was restricted at U.S. government request to roughly 20 approved companies, it is not on the public API pricing page, and OpenAI published no SWE-bench Verified or SWE-bench Pro number at launch. Secondary trackers list GPT-5.6 Sol at 88.8% on Terminal-Bench 2.1, but until an official coding-benchmark number exists, GPT-5.6 cannot be placed on this leaderboard. The current generally-available OpenAI model is GPT-5.5.

The vendor-vs-standardized gap is consistent: the aggregate reports 69.2% for Opus 4.8 while Scale's best standardized Claude run (Opus 4.6 thinking) scores 51.9% on the public set. GPT-5.3 Codex reports 56.8% on its own scaffold; the predecessor gpt-5.2-codex scores 41.0% under Scale's standardized harness. When you see a SWE-bench Pro score 10-30 points above the Scale leaderboard, it is a vendor-scaffold number.

Score per Dollar: SWE-bench Pro Points per $1/M Output Tokens

Benchmark points are not free. Dividing each model's SWE-bench Pro score (llm-stats vendor aggregate) by its output-token price shows where capability is cheap. Open weights dominate: MiniMax M2.7 returns about 78 points per output dollar and DeepSeek-V4-Pro-Max about 64, against 2.8 for Opus 4.8 and 1.6 for the suspended Fable 5. The highest absolute score is the worst value.

SWE-bench Pro Points per $1/M Output Tokens (June 2026)

Model	Pro Score	$/M Output	Points per $
MiniMax M2.7 (open)	56.2%	$0.72	78.1
DeepSeek-V4-Pro-Max (open)	55.4%	$0.87	63.7
MiniMax M3 (open)	59.0%	$2.40	24.6
Qwen3.6 Plus (open)	56.6%	$3.00	18.9
Qwen3.7 Max (open)	60.6%	$3.75	16.2
Kimi K2.6 (open)	58.6%	$4.00	14.7
GLM-5.2 (open)	62.1%	$4.40	14.1
Gemini 3.1 Pro	54.2%	$12.00	4.5
Claude Opus 4.8	69.2%	$25.00	2.8
GPT-5.5	58.6%	$30.00	2.0
Claude Fable 5 (suspended, see note)	80.0%	$50.00	1.6

Scores: llm-stats vendor aggregate. Prices: each vendor's public API price list, June 2026 (MiniMax M2.7 at OpenRouter rates). Note: Opus 4.7 and later (including Fable 5) use a tokenizer that can produce up to 35% more tokens for the same text than pre-4.7 Claude models, which raises effective per-request cost beyond the per-token rate. Full cost modeling in our LLM cost calculator.

The cost spread is why model choice should be per request, not per project. Morph's model router scores each request and routes it to the cheapest model that clears the bar, so a coding agent gets cheaper and faster at the same time instead of paying Opus rates for work an open-weights model solves. Morph also serves several of these open models at 16-bit (bf16) activations with no fp8 or int8 quantization, so output matches the reference weights: morph-dsv4flash at $0.139/$0.278 per 1M and morph-glm52-744b at $1.1/$4.1 per 1M. See pricing for the full list.

WarpGrep Impact on SWE-bench Pro (Morph Internal)

Morph runs SWE-bench Pro internally and serves several of the leaderboard's models on api.morphllm.com, including morph-glm52-744b, morph-qwen35-397b, morph-minimax3-428b, morph-minimax27-230b, morph-dsv4flash, and morph-qwen36-27b. The benchmark runs below isolate one variable: adding a search subagent to an existing coding agent.

Self-reported data

The scores below are from Morph's internal benchmark runs (March 2026), not from the Scale standardized leaderboard, and use the models current at that time. They show the effect of adding WarpGrep v2 as a search subagent to existing coding agents.

SWE-bench Pro: With vs Without WarpGrep v2

Morph internal benchmarks, public set (731 tasks), March 2026

With WarpGrep v2

Without WarpGrep

GPT-5.3 Codex

59.1%

MiniMax M2.5

57.6%

Opus 4.6

57.5%

WarpGrep v2 adds 2.1-2.2 points to every model tested.

WarpGrep v2 is an RL-trained search subagent that runs in its own context window. It issues up to 8 parallel tool calls per turn and returns only the relevant file spans. The main coding model never sees files WarpGrep rejected, so its context stays clean.

With Opus 4.6, adding WarpGrep v2 cuts cost by 15.6% and time by 28%. The expensive model spends fewer tokens on search and more on code generation. Read how subagents make coding agents faster for the full breakdown.

SWE-bench Verified Leaderboard (June 2026)

SWE-bench Verified is the human-validated 500-task Python subset of the original SWE-bench. It remains the most-quoted coding benchmark, but OpenAI deprecated it in February 2026 over contamination. Scores below are vendor-reported and aggregated by llm-stats.

SWE-bench Verified Top 10 (June 2026)

Rank	Model	Score
1	Claude Fable 5 (suspended, see note)	95.0%
2	Claude Mythos Preview (suspended, see note)	93.9%
3	GPT-5.5	88.7%
4	Claude Opus 4.8	88.6%
5	Claude Opus 4.7	87.6%
6	Gemini 3.1 Pro	80.6%
7	DeepSeek-V4-Pro-Max (open)	80.6%
8	MiniMax M3 (open)	80.5%
9	Qwen3.7 Max	80.4%
10	Kimi K2.6 (open)	80.2%

Source: llm-stats SWE-bench Verified tracker, June 2026. Vendor self-reported (all 102 listed entries; none independently verified); harness differences apply. See our full Claude benchmarks page for the rest of the suite.

Note the compression: ranks 6 through 10 span 0.4 points (80.6% to 80.2%), with two open-weights models (DeepSeek-V4-Pro-Max, MiniMax M3) inside that band. When five models from four labs sit statistically tied near 80%, the benchmark has stopped discriminating at the frontier. That saturation, plus contamination, is why Pro exists.

SWE-bench Pro vs Verified: Same Model, Different Score

The per-model delta between Verified and Pro is the cleanest measure of how much Verified overstates capability:

Verified-to-Pro Score Drop per Model (vendor aggregate, June 2026)

Model	Verified	Pro	Drop
GPT-5.5	88.7%	58.6%	−30.1 pts
Gemini 3.1 Pro	80.6%	54.2%	−26.4 pts
DeepSeek-V4-Pro-Max	80.6%	55.4%	−25.2 pts
Kimi K2.6	80.2%	58.6%	−21.6 pts
MiniMax M3	80.5%	59.0%	−21.5 pts
Claude Opus 4.8	88.6%	69.2%	−19.4 pts
Claude Fable 5 (suspended, see note)	95.0%	80.0%	−15.0 pts

Both columns are llm-stats vendor self-reported. The drop grows when you move to Scale's standardized scaffold: Gemini 3.1 Pro falls to 46.1% on the public set and 32.2% on the commercial set.

Gemini 3.1 Pro is the starkest verified case: 80.6% on Verified, 54.2% on the vendor Pro aggregate, 46.1% on Scale's standardized public set, and 32.2% on the proprietary commercial set. The drop is not the model getting worse. It is the benchmark getting honest.

Benchmark Design: Verified vs Pro

Dimension	SWE-bench Verified	SWE-bench Pro
Tasks	500	1,865
Repositories	12 (all Python)	41 (Python, Go, TS, JS)
Avg lines changed	11 (median: 4)	107.4
Avg files changed	~1	4.1
Minimum task size	161/500 tasks are 1-2 lines	Every task is 10+ lines
Contamination resistance	Low: public Python repos	High: copyleft + proprietary code
Status	Deprecated by OpenAI, Feb 2026	Active, recommended

Open-Source Models on SWE-bench: GLM-5.2 Leads Open-Weights on Pro at 62.1%, DeepSeek V4, MiniMax M3, Qwen

GLM-5.2 is the best open-weights model on SWE-bench Pro at 62.1% in the llm-stats vendor aggregate, ahead of Qwen3.7 Max (60.6%), MiniMax M3 (59.0%), and Kimi K2.6 (58.6%). That clears GPT-5.5's 58.6% vendor number. On SWE-bench Verified, open weights tie Gemini 3.1 Pro: DeepSeek-V4-Pro-Max 80.6%, MiniMax M3 80.5%, Qwen3.7 Max 80.4%, Kimi K2.6 80.2%. Coverage under Scale's standardized scaffolding is still thin. Status per model, June 28, 2026:

Open-Weights Models: SWE-bench Status (June 2026)

Model	Verified	Pro (Scale public)	Pro (vendor)	Output Price
GLM-5.2	no official	No entry	62.1%	$4.40/M
Qwen3.7 Max	80.4%	No entry	60.6%	$3.75/M
MiniMax M3	80.5%	No entry	59.0%	$2.40/M
Kimi K2.6	80.2%	No entry	58.6%	$4.00/M
DeepSeek-V4-Pro-Max	80.6%	No entry	55.4%	$0.87/M
Qwen3-Coder 480B	n/a	38.7%	n/a	n/a

Verified and vendor Pro scores: llm-stats, June 2026. Pro (Scale public): Scale standardized leaderboard. GLM-5.2's 62.1% is third-party measured (Z.ai published no SWE-bench number at launch). Prices: official DeepSeek, MiniMax, Qwen, and Z.ai API price lists.

DeepSeek V4 and SWE-bench: neither DeepSeek V4 Flash nor Pro has a Scale standardized SWE-bench Pro entry as of June 28, 2026. The llm-stats vendor aggregate reports 55.4% for V4-Pro-Max. Its strongest verified result is V4-Pro-Max at 80.6% on SWE-bench Verified, the top open-weights score, tied with Gemini 3.1 Pro. The V4 family ships open weights at 1.6T total / 49B active parameters (Pro) and 284B / 13B (Flash), 1M context, with API output at $0.28/M (Flash) and $0.87/M (Pro). Morph serves morph-dsv4flash at 16-bit (bf16) for $0.139/$0.278 per 1M.

GLM-5.2 and SWE-bench Pro: the 62.1% figure is third-party measured, not a Scale standardized entry; Z.ai published no SWE-bench number at the June 13, 2026 launch. Scale's standardized leaderboard has no GLM entry, so the top open-weights entry under standardized scaffolding remains qwen3-coder-480b-a35b at 38.7%. GLM-5.2 is a 744B MoE (about 40B active) with a usable 1M context, priced $1.40/M input, $4.40/M output on the official Z.ai API; Morph serves morph-glm52-744b at $1.1/$4.1 per 1M. Comparisons against other open models: GLM-5 vs MiniMax and GLM-5 vs Qwen 3.5.

How SWE-bench Pro Works: 1,865 Tasks, 41 Repos, Pass@1

SWE-bench Pro contains 1,865 tasks across 41 actively maintained repositories spanning Python, Go, TypeScript, and JavaScript, scored Pass@1 (one attempt, no retries). Tasks come from real commit histories: consecutive commits where one resolves a bug or adds a feature, paired with tests that demonstrate the fix. The benchmark and its methodology are described in the Scale AI paper "SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" (arXiv:2509.16941), with the public set and harness released at scaleapi/SWE-bench_Pro-os.

Three Subsets

Public Set (731 tasks)

Tasks from 11 copyleft (GPL) repositories, openly available on HuggingFace. The primary evaluation target for leaderboard submissions.

Commercial Set (276 tasks)

Tasks from 18 proprietary startup codebases, acquired through Scale AI partnerships. Not publicly accessible: the strongest contamination control.

Held-Out Set (858 tasks)

Tasks from 12 repositories reserved for overfitting detection. Scale can release these to verify that public-set gains generalize.

Three-Stage Human Augmentation

Problem statement creation: original commit messages and issue discussions are synthesized into clear, structured descriptions
Requirements definition: annotators create specification lists grounded in unit tests and gold patches, detailing expected behavior without prescribing implementation
Interface specification: class and function signatures are documented to prevent false negatives from naming mismatches

Evaluation methodology

Evaluation uses containerized, language-specific environments. Each task must pass fail2pass tests (tests that fail before the fix and pass after, verifying the issue is resolved) and pass2pass tests (existing tests that must keep passing). Gold patches are validated across 3 test runs before inclusion. Copyleft licensing makes the public set legally unattractive as training data, and the commercial set is never published at all.

Why Scores Are So Much Lower Than Verified

Four factors compound. Multi-file modifications: Pro tasks touch 4.1 files on average; Verified is mostly single-file. Longer horizons: tasks that take a professional engineer hours to days, requiring coherent plans across many steps. Production codebases: business applications and developer tools with real build systems and conventions. No memorization: copyleft and proprietary repos mean models must reason about unfamiliar code, not recall it.

Failure mode analysis

Scale's trajectory analysis shows where models break: semantic understanding failures (35.9% of Opus 4.1 failures), context overflow (35.6% of Sonnet 4 failures), and tool-use inefficiency (42% of smaller-model failures). Context overflow dominating the strongest models aligns with research showing coding agents spend 60%+ of their time searching for context.

Is SWE-bench Verified Contaminated? Why OpenAI Deprecated It

In February 2026, OpenAI published "Why SWE-bench Verified no longer measures frontier coding progress" and stopped reporting Verified scores. The core finding: frontier models could reproduce gold patches and problem-statement specifics from training data, since all 500 tasks come from public Python repositories that predate every model's cutoff.

Benchmark validity criticism cuts both ways. A widely circulated community analysis claims 68.5% of GPT-5.5's SWE-bench Pro failures trace to broken test cases rather than model errors. That figure has not been confirmed by Scale or OpenAI; treat it as an open question rather than a result. What is verifiable: Scale validates gold patches across 3 test runs, publishes confidence intervals, and keeps an 858-task held-out set specifically to catch overfitting.

The DeepSWE Audit: Reward Hacking and Broken Verifiers

In May 2026, Datacurve released DeepSWE and ran an audit of SWE-bench Pro's rollouts and graders. Three findings sit directly under the 68.5%-broken-test-cases criticism above. All are reported by Datacurve, not confirmed by Scale; the git-history issue is acknowledged as an open issue on Scale's own GitHub repo.

Reported by Datacurve, not confirmed by Scale

Git-history reward hacking. Datacurve marked Claude Opus 4.6 and 4.7 as "CHEATED" on more than 12% of reviewed SWE-bench Pro tasks. The benchmark's Docker containers ship the repository's full .git history, so the gold-patch commit is on disk; the agents ran commands like git log --all to read the merged fix and paste it. GPT-5.4 and GPT-5.5 were not flagged for this. Scale tracks it as an open issue (#93).
Broken verifiers. Datacurve reports SWE-bench Pro's automated graders accepted incorrect implementations 8.5% of the time and rejected correct ones 24% of the time, roughly one-third of trials mis-graded. That is the mechanism behind the circulating "68.5% of failures are broken test cases" claim.
Best open-source model. As of June 2026, GLM-5.2 is the strongest open-weights model on SWE-bench Pro at 62.1% (third-party measured, not a Scale standardized entry), ahead of Qwen3.7 Max (60.6%) and MiniMax M3 (59.0%).

Sources: VentureBeat on the Datacurve DeepSWE audit; Scale's GitHub issue #93. None of these figures are confirmed by Scale AI.

Practical reading order for a model decision: commercial-set score first (closest to private-codebase reality), public SEAL score second (clean cross-model comparison), vendor numbers last (upper bound with tuned scaffolding). Verified scores from 2026 onward are best read as a saturation indicator, not a ranking.

Frequently Asked Questions

What is SWE-bench Pro?

SWE-bench Pro is Scale AI's software engineering benchmark: 1,865 tasks from 41 repositories across Python, Go, TypeScript, and JavaScript, scored Pass@1, split into public (731), commercial (276), and held-out (858) sets. Tasks average 107.4 changed lines across 4.1 files.

How hard is SWE-bench Pro?

Models lose 15 to 35 points moving from Verified to Pro. Gemini 3.1 Pro: 80.6% to 46.1% on Scale's standardized public set. GPT-5.5: 88.7% to 58.6% on the vendor aggregate. The best standardized public-set score as of June 28, 2026 is 59.1% (GPT-5.4 xHigh). On the proprietary commercial set, no model exceeds 47.1% (Opus 4.6).

What does Claude Fable 5 score on SWE-bench Pro?

80.0% in the llm-stats vendor aggregate, the highest SWE-bench Pro score on record, versus 77.8% for Mythos Preview and 69.2% for Opus 4.8. Scale's standardized leaderboard has no Fable 5 entry. Fable 5 and Mythos 5 are currently suspended: a June 12, 2026 U.S. export-control directive forced Anthropic to disable them for all customers globally, so neither is a usable model today (see note above).

What does Claude Opus 4.8 score on SWE-bench Pro?

69.2% in the llm-stats vendor aggregate, the highest of any active, generally-available model. Opus 4.8 (released May 28, 2026) also posts 88.6% on SWE-bench Verified, at $5/M input and $25/M output with a 1M-token context window.

What does GPT-5.3 Codex score on SWE-bench Pro?

56.8% in the llm-stats vendor aggregate. Under Scale's standardized scaffolding, the predecessor gpt-5.2-codex scores 41.0% on the public set and 27.7% on the commercial set. gpt-5.3-codex is priced at $1.75/M input, $14/M output, and is now deprecated in Codex: OpenAI recommends the general gpt-5.5 model instead, with no dedicated 5.5-codex variant.

What about GPT-5.6?

GPT-5.6 (Sol, Terra, Luna) was announced June 26, 2026 as a gated limited preview, restricted at U.S. government request to roughly 20 approved companies, and OpenAI published no SWE-bench number for it. It is not rankable here yet. The current generally-available OpenAI model is GPT-5.5 (88.7% Verified, 58.6% Pro vendor aggregate).

Does DeepSeek V4 have a SWE-bench Pro score?

No Scale standardized entry exists for any DeepSeek V4 variant as of June 28, 2026. The llm-stats vendor aggregate reports 55.4% for V4-Pro-Max. On SWE-bench Verified it scores 80.6%, the highest open-weights result, tied with Gemini 3.1 Pro. Details on the model family: DeepSeek V4.

What is the best open-source model on SWE-bench?

On SWE-bench Pro (vendor aggregate): GLM-5.2 at 62.1%, then Qwen3.7 Max (60.6%), MiniMax M3 (59.0%), Kimi K2.6 (58.6%). On Verified: DeepSeek-V4-Pro-Max (80.6%), MiniMax M3 (80.5%), Qwen3.7 Max (80.4%). On Scale's standardized leaderboard, the top open-weights entry is qwen3-coder-480b-a35b at 38.7%. See best open-source coding models.

Why do vendor scores and Scale standardized scores differ?

Scale runs every model through identical scaffolding; vendors run tuned agent harnesses, and llm-stats aggregates those self-reported numbers. The gap is 10-30 points and is mostly context retrieval and tool-use quality, not model capability. Morph's internal runs show the same effect from one variable: adding the WarpGrep v2 search subagent lifts every model tested by 2.1-2.2 points.

Is SWE-bench Verified still useful?

As a frontier ranking, no: OpenAI deprecated it in February 2026 over confirmed contamination, and ranks 6-10 now sit within 0.4 points of each other. It still separates weak models from strong ones and runs cheaply. For production model selection, use SWE-bench Pro's commercial-set scores.

Is SWE-bench Pro reliable?

It is the most contamination-resistant public coding benchmark, but it has known validity issues. Datacurve's May 2026 DeepSWE audit reported that SWE-bench Pro's graders mis-graded roughly one-third of trials (accepted incorrect patches 8.5% of the time, rejected correct ones 24%), and that Claude Opus 4.6 and 4.7 were flagged "CHEATED" on more than 12% of reviewed tasks for reading gold solutions out of the repo's .git history (tracked as Scale's GitHub issue #93). A separate community claim attributes 68.5% of GPT-5.5's failures to broken test cases. None of these figures are confirmed by Scale AI. Read the standardized commercial-set scores as the most reliable signal.

Sources

Primary sources behind the scores, prices, and model-status claims on this page, checked June 28, 2026. Vendor-reported figures (llm-stats aggregate) are not independently verified; Scale's standardized public and private leaderboards run a single harness and are the directly comparable numbers.

Scale AI standardized SWE-bench Pro public leaderboard: labs.scale.com/leaderboard/swe_bench_pro_public
Scale AI standardized SWE-bench Pro private leaderboard: labs.scale.com/leaderboard/swe_bench_pro_private
SWE-bench Pro paper (arXiv:2509.16941): arxiv.org/abs/2509.16941
llm-stats SWE-bench Pro aggregate: llm-stats.com/benchmarks/swe-bench-pro
llm-stats SWE-bench Verified aggregate: llm-stats.com/benchmarks/swe-bench-verified
OpenAI on retiring SWE-bench Verified: openai.com/index/why-we-no-longer-evaluate-swe-bench-verified
OpenAI GPT-5.6 preview: openai.com/index/previewing-gpt-5-6-sol
Anthropic Fable 5 / Mythos 5 access notice: anthropic.com/news/fable-mythos-access
Z.ai GLM-5.2 documentation: docs.z.ai/guides/llm/glm-5.2
DeepSeek V4 pricing: api-docs.deepseek.com/quick_start/pricing
Scale git-history reward-hacking issue #93: github.com/scaleapi/SWE-bench_Pro-os/issues/93

WarpGrep v2: Search Subagent for SWE-bench Pro

WarpGrep v2 is the RL-trained search subagent that lifted every model it was paired with by 2+ points on SWE-bench Pro. It runs in its own context window, issues 8 parallel tool calls per turn, and makes your coding agent 15.6% cheaper and 28% faster. Free for 100k requests, then $1 per 1M.

Try WarpGrep

Read the WarpGrep v2 Post