Best LLM for Coding (2026): 12 Models Ranked by SWE-bench and Cost per Task

Best LLM for Coding: Quick Answer (July 2026)

The best LLM for coding in July 2026 is Claude Fable 5 (95.0% SWE-bench Verified), back on the API since July 1. For most teams that do not need the ceiling, Claude Opus 4.8 (88.6%, $5/$25) is the practical pick, and GPT-5.5 ties it at 88.7%. The best open-weight LLM is DeepSeek-V4-Pro-Max (80.6%), self-hostable under MIT. Cheapest per solved task: Claude Haiku 4.5, about $0.13 of output per benchmark point.

"Best LLM for coding" and "best AI model for coding" are the same question, and this page answers both. It divides the two columns every other ranking keeps apart, and adds a third that almost no one measures: SWE-bench Pro scores from Scale's standardized leaderboard, official per-token prices, and the output-dollar cost per solved benchmark point. Updated July 13, 2026: Claude Fable 5 restored (July 1), scores and prices refreshed, and the open-weight models you can run on Morph (GLM-5.2, Qwen 3.5, DeepSeek V4 Flash, MiniMax M3) added with real numbers.

Highest score

Claude Fable 5

95.0% SWE-bench Verified
80.0% SWE-bench Pro (vendor)
$10 / $50 per M tokens, 1M context
Back online July 1; Opus 4.8 at $5/$25 for most teams

Best on a standardized harness

GPT-5.4

59.10% SWE-bench Pro (Scale SEAL, #1)
$2.50 / $15 per M tokens
Superseded by GPT-5.5 ($5/$30), still sold

Cheapest per solved point

Claude Haiku 4.5

39.45% SWE-bench Pro (Scale SEAL)
$1 / $5 per M tokens
~$0.13 output per Pro point

What changed since June

Claude Fable 5 is available again: the US Commerce Department lifted the June 12 export-control order on June 30, and Anthropic restored Fable 5 on July 1 across the Claude Platform and Claude Code. GPT-5.6 (Sol, Terra, Luna) remains a limited, government-gated preview, not on the public API. GLM-5.2 (744B MoE / 40B active) runs a 1M context and is served on Morph at $1.1/$4.1. MiniMax M3 is open weights at 80.5% SWE-bench Verified. Claude Opus 4.8 (88.6%) is the everyday Anthropic default at $5/$25.

12 Models Ranked: SWE-bench Pro x Price x Cost per Solved Point

On a standardized harness, gpt-5.4 solves the most SWE-bench Pro tasks (59.10%) and Claude Haiku 4.5 solves them for the least money (about $0.13 of output per point). The table below uses Scale's SEAL public-set leaderboard, which runs every model through the same scaffolding on SWE-bench Pro (1,865 tasks, 41 professional repositories, scored Pass@1). The last column divides official output price by score: dollars of output tokens per benchmark point solved. Lower is more cost-effective.

SWE-bench Pro (Scale SEAL public set) x Official API Pricing, June 2026

Model	SWE-bench Pro	$/M input / output	Output $ per Pro point
gpt-5.4 (xHigh)	59.10%	$2.50 / $15	$0.25
Muse Spark (Meta)	55.00%	not published	n/a
Claude Opus 4.6 (thinking)	51.90%	$5 / $25	$0.48
Gemini 3.1 Pro (thinking)	46.10%	$2 / $12 (≤200K tokens)	$0.26
Gemini 3 Pro (preview)	43.30%	preview	n/a
gpt-5.2-codex	41.04%	superseded by gpt-5.5	n/a
Claude Haiku 4.5	39.45%	$1 / $5	$0.13
Qwen3 Coder 480B (open weights)	38.70%	self-host	n/a
Gemini 3 Flash	34.63%	see Gemini pricing	n/a
Kimi K2 Instruct (open weights)	27.67%	self-host	n/a

Three things fall out of the combined view. GPT-5.4 wins on score and is competitive on cost per point at $0.25. Haiku 4.5 solves 67% as many tasks as gpt-5.4 at a third of its output price, making it the cost-per-point leader at roughly $0.13. And Opus 4.6, the top Claude entry Scale has tested, pays a 2x cost-per-point premium over gpt-5.4 for 7.2 fewer points on this harness, which is exactly why Anthropic publishes its own numbers (covered below).

59.10%

Top score, standardized harness (gpt-5.4)

$0.13

Cheapest output-$ per Pro point (Haiku 4.5)

1,865

Tasks in SWE-bench Pro across 41 repos

The private set reorders the podium

Scale also runs a private (commercial) set drawn from proprietary startup codebases, 276 instances the models have never seen. Models that top the public set do not automatically top unseen code: Muse Spark, for example, scores 55.00% on the public set but 44.70% on the private set. If your repo looks nothing like open-source Python, weight the private-set ordering, not the public leaderboard.

Cost per Completed Task, Not per Token

The per-token price on a model card is close to useless for budgeting a coding agent. What you pay is the price of every token the model burns to finish a task: reasoning tokens, retries, and tool-call round trips. A model that is 12x cheaper per token can end up more expensive per task if it thinks 20x longer, and a model that is pricier per token can be cheaper per task if it solves in one pass. Rank by cost per completed task on your own traffic, not by the headline rate.

Artificial Analysis now scores coding agents on exactly this: average pay-per-token API cost per task, counting input, cache, reasoning, and output tokens separately (Coding Agent Index methodology). The gap this exposes is large. In one worked comparison, Cursor Composer costs $2.50/M output against GPT-5.5's $30/M, a 12x per-token difference, but averages about $0.07 per coding task versus $4.82, a 69x per-task difference, because the workhorse model burns far fewer tokens to close each task (UsageBox, 2026).

Two rules fall out of this for picking a coding LLM. First, a reasoning-heavy frontier model with a low per-token headline can still be the most expensive line on your bill, so measure tokens-per-task before you commit. Second, the cheapest per-token open models (DeepSeek V4 Flash at $0.14/$0.28, MiniMax M3 at $0.60/$2.40) only win on cost per task if they also solve in comparable token counts, which is why a difficulty router that sends the easy 80% to a cheap model and reserves a frontier model for the hard 20% beats any single-model default. The routing section below has the split.

69x

Per-task cost gap (Composer vs GPT-5.5)

12x

Per-token gap for the same pair

$0.07

Composer average cost per coding task

SWE-bench Verified Leaderboard (July 2026)

On SWE-bench Verified, Claude Fable 5 leads at 95.0% and is buyable again as of July 1; the next tier is GPT-5.5 (88.7%) and Claude Opus 4.8 (88.6%), a near tie. SWE-bench Verified is older, Python-only, and partially contaminated, but it is still the number every launch post quotes. Every score below is vendor self-reported; the llm-stats tracker lists 0 of them as independently verified. The top ten:

SWE-bench Verified: Top 10 (June 2026)

Source: llm-stats.com tracker, June 2026. Vendor self-reported; higher = more GitHub issues resolved.

Claude Fable 5

back online

95%

Claude Mythos Preview

restricted

93.9%

GPT-5.5

88.7%

Claude Opus 4.8

88.6%

Claude Opus 4.7

87.6%

DeepSeek-V4-Pro-Max

open weights

80.6%

Gemini 3.1 Pro

80.6%

MiniMax M3

open weights

80.5%

Qwen3.7 Max

80.4%

Kimi K2.6

open weights

80.2%

Claude holds 4 of the top 5; GPT-5.5 edges Opus 4.8 by 0.1 point. The 80-percent cluster is mostly open weights.

Two structural facts in this chart. First, the Claude line pulled away at the very top: Fable 5's 95.0% is 6.4 points above Opus 4.8 and 14.4 above the 80-percent cluster. With Fable 5 restored on July 1 it is the live ceiling again, though Opus 4.8 at $5/$25 is half its price. Second, the 80-percent cluster is now mostly open weights. DeepSeek-V4-Pro-Max (80.6%) ties Gemini 3.1 Pro exactly, and you can download its MIT-licensed weights.

Which Claude Model Is Best for Coding?

Claude Fable 5 (claude-fable-5) is the highest-scoring Claude for coding at 95.0% SWE-bench Verified and is buyable again as of July 1, but at $10/$50 it is twice the price of the everyday default, Claude Opus 4.8 (claude-opus-4-8): 88.6% SWE-bench Verified, 69.2% SWE-bench Pro on Anthropic's harness, $5/$25 per million tokens, 1M context with no long-context surcharge. Anthropic ships five current coding-relevant models. Exact API IDs, prices, and the decision logic:

Claude Models for Coding (July 2026, official pricing)

Model (API ID)	Coding benchmarks	$/M in / out	Context / max output
Claude Fable 5 (claude-fable-5)	95.0% SWE-bench Verified, 80.0% SWE-bench Pro (vendor); back online Jul 1	$10 / $50	1M / 128K
Claude Opus 4.8 (claude-opus-4-8)	88.6% Verified, 69.2% Pro (vendor)	$5 / $25	1M / 128K
Claude Opus 4.7 (claude-opus-4-7)	87.6% Verified	$5 / $25	1M / 128K
Claude Sonnet 4.6 (claude-sonnet-4-6)	79.6% SWE-bench Verified	$3 / $15	1M / 64K
Claude Haiku 4.5 (claude-haiku-4-5)	39.45% SWE-bench Pro (Scale SEAL)	$1 / $5	200K / 64K

Default: Opus 4.8

claude-opus-4-8 at $5/$25 is the working default for coding agents: 88.6% SWE-bench Verified, 69.2% SWE-bench Pro on Anthropic's harness (the highest of any buyable model), and 1M context with no long-context surcharge. A fast-mode research preview is priced at $10/$50, about 3x cheaper than fast mode on Opus 4.6/4.7.

Ceiling: Fable 5 (back online)

claude-fable-5 ($10/$50) adds 6.4 points of SWE-bench Verified and 10.8 points of vendor SWE-bench Pro over Opus 4.8. It was suspended June 12 under a US export-control directive, then restored July 1, 2026 after the order was lifted. It is the top-scoring buyable Claude again; Opus 4.8 at half the price covers most teams.

Volume: Sonnet 4.6

claude-sonnet-4-6 at $3/$15 carries a 1M context window and scores 79.6% SWE-bench Verified. Use it for high-throughput agent loops where Opus pricing compounds: CI review bots, test generation, batch transforms. Batch API halves it to $1.50/$7.50.

Quick edits and subagents: Haiku 4.5

claude-haiku-4-5 at $1/$5 is the cost-per-point leader on Scale's leaderboard (~$0.13 of output per Pro point). Route single-file edits, lint fixes, and explore-style subagents here; cache hits cost $0.10/M.

Legacy models and the tokenizer cost trap

Treat Claude Sonnet 4, Opus 4, and Opus 4.1 as legacy and migrate to claude-sonnet-4-6 / claude-opus-4-8. Note the tokenizer change too: Opus 4.7 and later (including Fable 5) can produce up to 35% more tokens for the same text than pre-4.7 models, so compare per-request costs, not just per-token rates. Full price tables on the Anthropic API pricing page.

One model to set aside: Claude Mythos 5 (93.9% SWE-bench Verified as Mythos Preview) is a limited-availability model restricted to approved Project Glasswing partners. Unlike Fable 5, which returned to the API on July 1, Mythos 5 remains limited to approved partners. There is no self-serve access, so it is not a practical coding pick.

Claude Opus 4.8 vs GPT-5.5: The $5-Input Flagships

Opus 4.8 and GPT-5.5 both cost $5/M input and tie on SWE-bench Verified (88.6% vs 88.7%); Opus 4.8 wins repo-scale software engineering on SWE-bench Pro (69.2% vs 58.6%) and costs less output ($25 vs $30), while GPT-5.5 is the model OpenAI now ships through Codex. The splits:

Opus 4.8 vs GPT-5.5 (June 2026)

Dimension	Claude Opus 4.8	GPT-5.5
SWE-bench Verified (vendor)	88.6%	88.7%
SWE-bench Pro (vendor)	69.2%	58.6%
Pricing ($/M in / out)	$5 / $25	$5 / $30
Cached input ($/M)	$0.50 (cache hit)	$0.50
Context window	1M	~1M (128K max output)

Head-to-Head: The Race Card

GPT-5.3 Codex vs Claude Opus 4.6 across 7 dimensions

Codex (3)

Opus (4)

Raw SpeedCodex leads

Codex

Opus

25% faster executionThorough but slower

Reasoning DepthOpus leads

Codex

Opus

Strong on algorithmsGPQA Diamond leader

Intent UnderstandingOpus leads

Codex

Opus

Needs detailed promptsGets vague requests right

Token EfficiencyCodex leads

Codex

Opus

2-4x fewer tokensThinks out loud more

Multi-file RefactoringOpus leads

Codex

Opus

Good at scoped editsHandles 10+ files cleanly

Code ReviewCodex leads

Codex

Opus

Finds edge cases fastDeeper architectural insight

Context WindowOpus leads

Codex

Opus

256K tokens1M tokens (beta)

Scores based on benchmarks, developer surveys, and hands-on testing as of February 2026. Neither model "wins" overall — it depends on your workflow.

Opus 4.8 wins repo-scale software engineering (69.2% vs 58.6% SWE-bench Pro on vendor harnesses) and is cheaper on output ($25 vs $30 per M). GPT-5.5 edges SWE-bench Verified by 0.1 point and is what OpenAI now points Codex at: the dedicated -codex variants are retired, so the general gpt-5.5 is the current Codex model. If your work lives in a CLI agent, Codex pricing changes the math; if it lives in long-horizon repo edits, Opus 4.8 does.

GPT-5.6 (Sol, Terra, Luna) is a limited preview

OpenAI previewed GPT-5.6 on June 26, 2026 in three variants: Sol (flagship), Terra (balanced), Luna (fast). It is gated, available only to about 20 government-approved companies after a US government request, not on the public API, with general availability promised in the coming weeks. OpenAI published no SWE-bench score at preview; a secondary tracker lists GPT-5.6 Sol at 88.8% on Terminal-Bench 2.1. Until it ships broadly, GPT-5.5 ($5/$30) is the OpenAI coding model you can actually buy.

Open-Source Models: 80% SWE-bench Verified at a Tenth of the Price

The best open-source coding model in July 2026 is DeepSeek-V4-Pro-Max at 80.6% SWE-bench Verified, tied with Gemini 3.1 Pro and self-hostable under MIT. MiniMax M3 matches it at 80.5% with a 1M context window, and Kimi K2.6 reaches 80.2%. Official API prices and self-host terms:

Open-Source / Open-Weights Coding Models (June 2026, official pricing)

Model	SWE-bench Verified	$/M in / out (official API)	Context / license
DeepSeek-V4-Pro-Max (1.6T / 49B active)	80.6% (top open weights)	$0.435 / $0.87 (V4 Pro)	1M / MIT
DeepSeek V4 Flash (284B / 13B active)	79.0% (Flash-Max)	$0.14 / $0.28	1M / MIT
morph-dsv4flash (DeepSeek V4 Flash on Morph)	bf16 activations, codegen spec decoding + kernels	$0.139 / $0.278	MIT weights, hosted
MiniMax M3	80.5%	$0.60 / $2.40	~1M / open weights
Kimi K2.6 (1T / 32B active)	80.2%	$0.95 / $4.00	256K / open weights
Qwen3.6 Plus	78.8%	$0.50 / $3.00	1M / Qwen
GLM-5.2 (744B / 40B active)	no official SWE-bench; 62.1% Pro (third-party)	$1.40 / $4.40	1M / open weights
morph-glm52-744b (GLM-5.2 on Morph)	bf16 activations, codegen kernels	$1.1 / $4.1	1M, hosted

The arithmetic that matters: MiniMax M3 produces output at $2.40/M against Opus 4.8's $25/M, a 10.4x gap, while trailing it by 8.1 points on SWE-bench Verified (80.5 vs 88.6). DeepSeek V4 Flash sets the absolute floor at $0.28/M output with a 1M-token context. The price gap is mostly model size, not provider margin, which is the core fact of how AI inference is priced. For teams with data-sovereignty requirements, DeepSeek V4's MIT license means the 80.6% model is self-hostable outright. Deeper coverage on the best open-source coding model page.

80.6%

DeepSeek-V4-Pro-Max SWE-bench Verified

$0.28/M

DeepSeek V4 Flash output price

10.4x

Opus 4.8 output premium over MiniMax M3

Where you run open weights changes the output

Open weights are identical everywhere, the serving stack is not. Most serverless providers quantize activations to fp8 to cut cost, which degrades output quality. Morph serves these models with 16-bit (bf16) activations and does not quantize them, so output matches the reference weights. That makes Morph the place to run open models when fidelity matters. For coding specifically, Morph adds speculative decoding tuned on code plus custom low-level inference kernels built for code generation. Four coding-relevant open models on the same OpenAI-compatible API: GLM-5.2 (morph-glm52-744b) at $1.1/$4.1, Qwen 3.5 397B (morph-qwen35-397b) at $0.5/$3.5, DeepSeek V4 Flash (morph-dsv4flash) at $0.139/$0.278, and MiniMax M3 (morph-minimax3-428b) at $0.6/$2.4. Full catalog on Morph models and pricing.

What People Actually Code With (r/LocalLLaMA, mid-2026)

Benchmark tables and the models people run every day have drifted apart. On r/LocalLLaMA, the open-weight coding conversation in mid-2026 is dominated by GLM-5.2 and Qwen 3.6, not by whichever model tops SWE-bench that week. The recurring theme: a model you can actually run, at a token budget you can afford, beats two extra benchmark points you pay frontier prices for.

The most-upvoted open-weight coding threads on the subreddit this cycle are about running GLM-5.2 locally, not about the leaderboard. "GLM-5.2 is a win for local AI" (1,200+ upvotes, 315 comments) and "GLM-5.2 on 5x Pro 6000s and a 5090, an expensive journey" (1,500+ upvotes) are the community reckoning with what it costs to self-host a 744B MoE well enough to code against. The Qwen side shows up as "Qwen3.6 35B-A3B (Q8_0, no KV quant) single prompt", a 35B active-3B model people run unquantized on a single box.

Two practitioner signals matter for a "best LLM for coding" pick. The GPU math is brutal: threads describe five RTX Pro 6000s plus a 5090 to serve GLM-5.2 at usable speed, which is why most teams that want the open model's output rent it hosted rather than build the rack. And the export-control drama around Claude Fable and gated GPT-5.6 ("US Govt to individually approve who gets GPT 5.6") is a real availability risk that pushed developers toward open weights they control. Running GLM-5.2 or DeepSeek V4 on a hosted OpenAI-compatible API like Morph is the middle path: the open model's output without the rack.

Best AI Model for Coding at $0

The best free path to real coding capability in July 2026 is DeepSeek V4's MIT-licensed open weights (self-host) or its near-free hosted API at $0.14/$0.28. Four no-credit-card options:

Free Coding-Model Options (June 2026)

Option	What you get	Limit
Codex CLI on ChatGPT Free	Codex CLI with GPT-5.5, $0/mo	Lowest usage limits; Plus ($20/mo) raises them
Z.AI free GLM tier	Free Flash-tier GLM models on the Z.AI API	Flash tier only; GLM-5.2 flagship is paid at $1.40/$4.40
Qwen on Alibaba Model Studio	Free token allowance for new users across Qwen models	Time-limited trial
DeepSeek V4 open weights	MIT-licensed weights, self-host V4 Flash (284B/13B active)	Your GPU cost; hosted API is $0.14/$0.28 anyway

The honest framing: free tiers are for evaluation and light use. DeepSeek V4 Flash's paid API at $0.14/M input ($0.0036/M on cache hits) and $0.28/M output is close enough to zero that most teams skip self-hosting unless data cannot leave their network.

Why Vendor Scores Run 20 Points Above Scale's Leaderboard

Anthropic reports Fable 5 at 80.0% on SWE-bench Pro. Scale's standardized leaderboard tops out at 59.10% (gpt-5.4). Both numbers are real. The difference is the harness: Scale runs every model through identical scaffolding; vendors run their own tuned agent stacks. (Vendor SWE-bench Verified numbers are also self-reported; llm-stats lists 0 as independently verified.)

Same Benchmark, Different Harness: SWE-bench Pro

Vendor-reported (Anthropic scaffold) vs Scale SEAL standardized scaffolding.

Fable 5 (vendor)

80%

Opus 4.8 (vendor)

69.2%

gpt-5.4 (Scale, #1)

59.1%

GPT-5.5 (vendor)

58.6%

Opus 4.6 (Scale, top Claude)

51.9%

Gemini 3.1 Pro (Scale)

46.1%

The vendor-vs-standardized gap is 17-21 points for the same model families. The harness is the variable.

The practical conclusion has not changed since 2025: the scaffold around the model accounts for more variance than swapping frontier models. Before paying a 2x token premium, fix retrieval, context management, and tool design. Subagent architecture and context engineering move scores more than model choice does.

The implication

A mid-tier model in a strong harness beats a frontier model in a weak one. Tools like WarpGrep (semantic codebase search for terminal agents, $0 for 100k requests) upgrade the harness for every model you route through it.

Per-Task Routing: Which Model for Which Job

The most cost-effective setups in July 2026 route by task, not by loyalty: send the hard 20% to Opus 4.8 and the cheap 80% to Haiku 4.5 or DeepSeek V4 Flash. Numbers-backed defaults:

Task Routing Matrix (June 2026)

Task	Route to	Why (verified numbers)
Overnight refactor, 50+ files	Claude Opus 4.8	69.2% SWE-bench Pro (vendor), 1M context, no long-context surcharge
Hardest debugging / migration runs	Claude Fable 5 or Opus 4.8	95.0% / 88.6% Verified; Fable 5 back online Jul 1, Opus at half the price
Quick edits, lint fixes, subagents	Claude Haiku 4.5	$1/$5, ~$0.13 output per Pro point, $0.10/M cache hits
Terminal / Codex workflows	GPT-5.5	$5/$30, the model OpenAI now ships through Codex
Standardized-harness ceiling	gpt-5.4	59.10% Scale SEAL SWE-bench Pro at $2.50/$15
High-volume batch / CI bots	DeepSeek V4 Flash or MiniMax M3	$0.28/M and $2.40/M output, both ~1M context
Budget proprietary, long prompts	Gemini 3.1 Pro	46.10% Scale Pro at $2/$12 (≤200K); input doubles to $4 above 200K
Data sovereignty / self-host	DeepSeek V4 (MIT)	80.6% SWE-bench Verified (Pro Max), weights on Hugging Face
Codebase search for any agent	WarpGrep + any model	Model-agnostic retrieval; $0 for 100k requests

Cost levers that apply across routes: Anthropic's Batch API is 50% off input and output, prompt-cache reads are 0.1x base input, and DeepSeek cache hits drop input to $0.0036/M. A routing setup that pins 80% of traffic to Haiku 4.5 or DeepSeek V4 Flash and reserves Opus 4.8 for the hard 20% typically beats any single-model subscription. Doing the split automatically needs a classifier: Morph's Router scores each prompt by difficulty and domain in ~180ms and returns the cheapest capable model, so a coding agent gets cheaper and faster at once (see the model lineup and pricing). Claude Code Router makes that per-request routing concrete inside the terminal agent, and Claude Code models covers harness-side defaults.

Frequently Asked Questions

What is the best LLM for coding?

"Best LLM for coding" and "best AI model for coding" are the same question. The top LLM in July 2026 is Claude Fable 5 (95.0% SWE-bench Verified, $10/$50), restored to the API on July 1 after a two-week export-control suspension. For most teams that do not need the ceiling, Claude Opus 4.8 (claude-opus-4-8, 88.6% Verified, 69.2% SWE-bench Pro vendor, $5/$25, 1M context) is the practical pick, and GPT-5.5 ties it at 88.7%. The best open-weights LLM is DeepSeek-V4-Pro-Max at 80.6%, with MiniMax M3 (80.5%) and Qwen3.7 Max (80.4%) within 0.2 points. Cheapest per solved point: Claude Haiku 4.5 at about $0.13 of output. For most teams the best answer is not one LLM but a router that sends easy work to a cheap or open model and reserves a frontier model for hard edits.

What is the best AI model for coding in 2026?

Same answer as best LLM for coding. In July 2026, Claude Fable 5 (95.0% SWE-bench Verified) is the highest score and is buyable again as of July 1; Claude Opus 4.8 (88.6%, $5/$25) is the everyday default and GPT-5.5 ties it at 88.7%. On Scale's standardized SWE-bench Pro leaderboard, gpt-5.4 leads at 59.10%, ahead of Muse Spark (55.00%) and Claude Opus 4.6 (51.90%). Cost-adjusted, Claude Haiku 4.5 ($1/$5) is the cheapest per solved benchmark point at roughly $0.13 of output, and per completed task a workhorse model can be an order of magnitude cheaper than a reasoning-heavy frontier model at the same per-token rate.

Is there an open source LLM as good as Claude for coding?

Close, not equal. The best open-weight LLM, DeepSeek-V4-Pro-Max, scores 80.6% SWE-bench Verified against Claude Opus 4.8's 88.6% and Fable 5's 95.0%, an 8 to 14 point gap. But the open models cost roughly a tenth as much per output token (MiniMax M3 $2.40 and GLM-5.2 $4.10 on Morph versus Opus at $25), ship under MIT or similar permissive licenses you can self-host, and are within single digits of the frontier on standardized harnesses. For the 80% of coding work that is not the hardest debugging or migration, an open model at a tenth of the price is the better buy. Deeper comparison on the best open source LLM and best open source coding model pages.

Which Claude model is best for coding?

Claude Fable 5 (claude-fable-5, $10/$50) is the highest-scoring Claude at 95.0% SWE-bench Verified and is available again as of July 1. Claude Opus 4.8 (API ID claude-opus-4-8, $5/$25) is the everyday default at half the price: 88.6% SWE-bench Verified, 69.2% SWE-bench Pro on Anthropic's harness, 1M context with no long-context surcharge. Claude Sonnet 4.6 (claude-sonnet-4-6, $3/$15, 79.6% Verified) is the volume pick with a 1M context, and Claude Haiku 4.5 (claude-haiku-4-5, $1/$5) handles quick edits and subagents. Treat Sonnet 4, Opus 4, and Opus 4.1 as legacy and migrate off them.

What is the best Codex model for coding in 2026?

OpenAI now points Codex at the general gpt-5.5 model; there is no dedicated gpt-5.5-codex variant, and gpt-5.3-codex is deprecated. So the best Codex model in July 2026 is gpt-5.5 ($5/$30 per million tokens, 88.7% SWE-bench Verified vendor-reported, 58.6% SWE-bench Pro). GPT-5.6 (Sol, Terra, Luna) previewed June 26, 2026 but remains a limited preview gated to about 20 government-approved companies and not yet on the public API; OpenAI did not publish a SWE-bench score for it at launch.

What are the SWE-bench Pro scores for coding models in 2026?

Scale SEAL public set (standardized scaffolding, June 2026): gpt-5.4 xHigh 59.10%, Muse Spark 55.00%, Opus 4.6 thinking 51.90%, Gemini 3.1 Pro thinking 46.10%, Gemini 3 Pro 43.30%, gpt-5.2-codex 41.04%, Haiku 4.5 39.45%, Qwen3 Coder 480B 38.70%, Gemini 3 Flash 34.63%, Kimi K2 Instruct 27.67%. Vendor-reported numbers run higher: Anthropic reports Fable 5 at 80.0% and Opus 4.8 at 69.2% on its own scaffold. All vendor numbers are self-reported; llm-stats lists 0 as independently verified.

What is the best free LLM for coding?

Four real $0 paths in July 2026: Codex CLI is included with a ChatGPT Free sign-in (lowest usage limits); Z.AI offers free Flash-tier GLM models on its API; Alibaba Model Studio gives new users a free token allowance across Qwen models; and DeepSeek V4's MIT-licensed weights are self-hostable. If you have a GPU, the strongest genuinely free LLM for coding is a self-hosted open-weight model (DeepSeek V4 Flash or GLM-5.2). DeepSeek V4 Flash's paid API is near-free anyway at $0.14/M input, $0.28/M output.

What is the best open-source AI model for coding?

DeepSeek-V4-Pro-Max leads open weights at 80.6% SWE-bench Verified, tied with Gemini 3.1 Pro. MiniMax M3 scores 80.5% (1M context, $0.60/$2.40), Kimi K2.6 80.2%, and Qwen3.6 Plus 78.8%. DeepSeek V4 ships under MIT: V4 Pro (1.6T total / 49B active) costs $0.435/$0.87 on the official API, V4 Flash (284B/13B) costs $0.14/$0.28 with a 1M context. GLM-5.2 (744B MoE / 40B active) shipped without an official SWE-bench number; third-party trackers put it at 62.1% on SWE-bench Pro.

How much do the top coding models cost per million tokens?

Output price ladder, July 2026: DeepSeek V4 Flash $0.28, MiniMax M2.7 $0.72, MiniMax M3 $2.40, Qwen 3.5 397B $3.50, Qwen3.6 Plus $3.00, Qwen3.7 Max $3.75, Kimi K2.6 $4.00, GLM-5.2 $4.40 ($4.10 on Morph), Gemini 3.1 Pro $12, gpt-5.3-codex $14, gpt-5.4 and Claude Sonnet 4.6 $15, Claude Opus 4.8 $25, GPT-5.5 $30, Claude Fable 5 $50. Inputs range from $0.14/M (DeepSeek V4 Flash) to $10/M (Fable 5).

Why do vendor benchmark scores differ from Scale's leaderboard?

Scale runs every model through identical standardized scaffolding on SWE-bench Pro's 1,865 tasks across 41 repositories, scored Pass@1; vendors run their own tuned harnesses. The same model family scores 51.90% (Opus 4.6 on Scale) versus 69.2% (Opus 4.8 on Anthropic's harness). That 17-to-21-point spread is the harness, which is why agent tooling moves results more than model swaps.

Which AI model is most cost-effective for coding in 2026?

Dividing output price by Scale SEAL SWE-bench Pro score: Claude Haiku 4.5 about $0.13 of output per point, gpt-5.4 $0.25, Gemini 3.1 Pro $0.26, Claude Opus 4.6 $0.48. For raw per-token cost with a 1M context, DeepSeek V4 Flash at $0.14/$0.28 is the floor.

Sources

Primary sources behind the scores and prices on this page (updated July 13, 2026):

Scale SEAL SWE-bench Pro public leaderboard (standardized harness scores)
llm-stats SWE-bench Verified tracker and SWE-bench Pro aggregate (vendor self-reported)
SWE-bench Pro paper (1,865 tasks / 41 repos, benchmark definition)
Anthropic Claude pricing and Fable / Mythos access notice (suspension)
OpenAI API pricing, Codex models, and GPT-5.6 preview
Google Gemini API pricing
DeepSeek API pricing and Z.ai GLM-5.2 docs
Meta Muse Spark announcement
Artificial Analysis Coding Agent Index methodology (cost per task, token usage per task)
UsageBox: cost per task vs per-token pricing (Composer vs GPT-5.5 per-task figures)
r/LocalLLaMA (practitioner threads on GLM-5.2 and Qwen 3.6 for local coding, mid-2026)

Stop Debating Models. Start Searching Codebases.

WarpGrep adds semantic codebase search to any terminal agent. Works with Opus 4.8, GPT-5.5, Gemini 3.1 Pro, DeepSeek V4, GLM-5.2, or any model. $0 for 100k requests, $1 per 1M on Pro. The harness matters more than the model.

Try WarpGrep

View Docs

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Best LLM for Coding (2026): 12 Models Ranked by SWE-bench and Cost per Task

Best LLM for Coding: Quick Answer (July 2026)

12 Models Ranked: SWE-bench Pro x Price x Cost per Solved Point

Cost per Completed Task, Not per Token

SWE-bench Verified Leaderboard (July 2026)

SWE-bench Verified: Top 10 (June 2026)

Which Claude Model Is Best for Coding?

Default: Opus 4.8

Ceiling: Fable 5 (back online)

Volume: Sonnet 4.6

Quick edits and subagents: Haiku 4.5

Claude Opus 4.8 vs GPT-5.5: The $5-Input Flagships

Head-to-Head: The Race Card

Open-Source Models: 80% SWE-bench Verified at a Tenth of the Price

What People Actually Code With (r/LocalLLaMA, mid-2026)

Best AI Model for Coding at $0

Why Vendor Scores Run 20 Points Above Scale's Leaderboard

Same Benchmark, Different Harness: SWE-bench Pro

Per-Task Routing: Which Model for Which Job

Frequently Asked Questions

What is the best LLM for coding?

What is the best AI model for coding in 2026?

Is there an open source LLM as good as Claude for coding?

Which Claude model is best for coding?

What is the best Codex model for coding in 2026?

What are the SWE-bench Pro scores for coding models in 2026?

What is the best free LLM for coding?

What is the best open-source AI model for coding?

How much do the top coding models cost per million tokens?

Why do vendor benchmark scores differ from Scale's leaderboard?

Which AI model is most cost-effective for coding in 2026?

Sources

Stop Debating Models. Start Searching Codebases.