The Top 10 Coding LLMs, Ranked (March 2026)
Multiple models now resolve 80%+ of real GitHub issues on SWE-bench Verified. A year ago, the top score was 65%. The frontier has compressed: picking between the top three is a question of cost, speed, and workflow fit, not raw capability. On Aider's polyglot benchmark (225 exercises across C++, Go, Java, JavaScript, Python, and Rust), GPT-5 leads at 88%, followed by o3-pro at 84.9% and Gemini 2.5 Pro at 83.1%.
| Rank | Model | SWE-bench Verified | Aider Polyglot | Input / Output (per MTok) | Context |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 81.4% | 72.0%* | $5 / $25 | 200K (1M beta) |
| 2 | Gemini 3.1 Pro | 80.6% | 83.1% | $2 / $12 | 1M |
| 3 | GPT-5 (high) | ~80% | 88.0% | $2.50 / $15 | 1M |
| 4 | MiniMax M2.5 (open) | 80.2% | N/A | $0.30 / $1.20 | 1M |
| 5 | Claude Sonnet 4.6 | ~79% | ~65%* | $3 / $15 | 200K (1M beta) |
| 6 | DeepSeek V3.2 (open) | 74.2%** | 74.2% | $0.27 / $1.00 | 131K |
| 7 | Kimi K2.5 (open) | 76.8% | 59.1% | $0.35 / $1.40 | 262K |
| 8 | GPT-4.1 | ~72% | 52.4% | $2 / $8 | 1M |
| 9 | Qwen3-Coder-Next | 70%+ | N/A | $0.12 / $0.75 | 262K |
| 10 | Gemini 2.5 Flash | ~70% | 55.1% | $0.30 / $2.50 | 1M |
Reading the table
*Opus 4 scores on Aider, not yet Opus 4.6. Aider's leaderboard was last updated November 2025 and does not include the newest Claude 4.6 models. SWE-bench Verified scores come from official model announcements. **DeepSeek V3.2 Aider score is from the "Reasoner" variant. We included every model scoring above 70% on SWE-bench or with meaningful differentiation (price, speed, open weights). HumanEval is excluded: five models score 95%+, making it useless for ranking.
Methodology: How We Evaluated
No single benchmark captures coding ability. We weight four benchmarks and two operational metrics:
SWE-bench Verified (40% weight)
Resolves real GitHub issues from popular Python repositories. 500 verified instances, human-validated. The closest proxy to actual software engineering work. Opus 4.6 leads at 81.4%.
Aider Polyglot (20% weight)
225 Exercism exercises across C++, Go, Java, JavaScript, Python, and Rust. Tests multi-language code editing, not just Python. GPT-5 (high) leads at 88.0%, with DeepSeek V3.2 at 74.2% for 1/50th the cost.
LiveCodeBench (20% weight)
Continuously sources fresh competitive programming problems from LeetCode, AtCoder, and CodeForces. No contamination risk. Gemini 3.1 Pro leads at 2,887 Elo. Tests algorithmic reasoning on problems the model has never seen.
Terminal-Bench 2.0 (20% weight)
Measures agentic terminal task completion: running commands, parsing output, iterating. Gemini 3.1 Pro scores 68.5%. Tests how models behave in real developer workflows, not isolated code generation.
Beyond benchmarks, we factor in pricing (cost per million tokens), context window (effective working memory), and output speed (tokens per second). A model that scores 2% higher but costs 10x more and runs 3x slower is not the better choice for most teams.
Best LLM for Code Generation
Code generation is the most common use case: you describe what you want, the model writes it. For isolated function generation, benchmark saturation means most frontier models perform similarly. The differences emerge in complex, multi-step generation.
Winner: Claude Opus 4.6
Opus 4.6 scores 81.4% on SWE-bench Verified with 128K max output tokens. It excels at understanding ambiguous intent. When a prompt is underspecified, Opus infers the right approach more often than GPT or Gemini. Anthropic reports that users preferred Sonnet 4.6 over Opus 4.5 59% of the time, calling it "less prone to overengineering." Opus 4.6 corrected this: it reads more context before modifying code and consolidates shared logic instead of duplicating it.
The tradeoff is cost. At $5/$25 per MTok, Opus is the most expensive frontier model. For high-volume code generation pipelines, this adds up. But it also supports 1M context in beta, matching Gemini's working memory.
Value Pick: Claude Sonnet 4.6
Sonnet 4.6 is preferred over Opus 4.5 by 70% of Claude Code users, at $3/$15 per MTok. It produces notably polished frontend code with better layouts and animations than previous Claude models. For most code generation tasks, the gap between Sonnet 4.6 and Opus 4.6 is undetectable. Default to Sonnet unless you are hitting limits on complex multi-file reasoning.
Budget Pick: DeepSeek V3.2
At $0.27/$1.00 per MTok, DeepSeek V3.2 scores 74.2% on Aider's polyglot benchmark. On that same benchmark, it costs $1.30 total vs $65.75 for Opus 4. For boilerplate generation, CRUD endpoints, and routine coding tasks, it delivers 90% of frontier quality at 1/50th the cost. The 131K context window limits it for large codebase work.
Best LLM for Code Review and Debugging
Code review requires understanding existing code, identifying bugs, and explaining issues clearly. This tests reading comprehension more than generation.
Winner: GPT-5.3-Codex
GPT-5.3-Codex was built for terminal-native workflows. SemiAnalysis describes its 2.93x faster inference as the result of hardware/software co-design. At ~240 tokens/second, it returns code review feedback almost instantly. The Codex CLI experience, with context compaction and long-running task support, makes GPT-5.x feel stronger in practice than isolated benchmarks suggest.
On Aider's polyglot benchmark, GPT-5 (high reasoning) scores 88.0%, 16 points above Opus 4. The cost: $29.08 for the full benchmark run vs $65.75 for Opus 4. Faster and cheaper.
Runner-up: Claude Opus 4.6
Where Opus excels over Codex in code review: explanations. It provides more detailed, contextual reasoning about why code is problematic. Anthropic specifically highlights Opus 4.6's improved root cause analysis and its ability to catch its own mistakes. For reviewing code from junior developers or unfamiliar codebases, this depth matters more than raw speed.
Best LLM for Multi-File Refactoring
Multi-file changes are the hardest coding task for LLMs. The model must understand how files relate, plan coordinated changes, and avoid breaking interfaces between components. This is where the gap between frontier and mid-tier models is widest.
Winner: Claude Opus 4.6
Opus 4.6 with WarpGrep v2 hit 57.5% on SWE-bench Pro. Without WarpGrep, Opus 4.6 scores 55.4%. The difference: WarpGrep gives the model semantic search over the codebase, so it finds the right files to edit instead of guessing.
Opus 4.6 also earned $3,050.53 more than Opus 4.5 on Vending-Bench 2, a measure of sustained agentic performance. For large refactoring jobs that span dozens of files over hours, this sustained coherence matters. METR's autonomous task horizons are doubling every 4-7 months, and Opus 4.6 is the current leader.
Runner-up: Gemini 3.1 Pro
Gemini 3.1 Pro scores 80.6% on SWE-bench Verified and 68.5% on Terminal-Bench 2.0, with a native 1M context window. The context window advantage is real: for large codebases, Gemini can hold more of the project in working memory without context compaction. At $2/$12 per MTok, it is 2.5x cheaper than Opus on input and 2x cheaper on output.
API Pricing Comparison
LLM API prices dropped roughly 80% from 2025 to 2026. The range is still wide: Qwen3-Coder-Next costs $0.12/MTok input, Claude Opus 4.6 costs $5.00. That is a 42x spread.
| Model | Input (per MTok) | Output (per MTok) | Batch Discount | Cache Savings |
|---|---|---|---|---|
| Qwen3-Coder-Next | $0.12 | $0.75 | N/A | N/A |
| DeepSeek V3.2 | $0.27 | $1.00 | N/A | 90% (cache hits) |
| Gemini 2.5 Flash | $0.30 | $2.50 | 50% | Context caching |
| MiniMax M2.5 | $0.30 | $1.20 | N/A | N/A |
| Kimi K2.5 | $0.35 | $1.40 | N/A | N/A |
| Claude Haiku 4.5 | $1.00 | $5.00 | 50% | 90% |
| Gemini 3.1 Pro | $2.00 | $12.00 | 50% | Context caching |
| GPT-4.1 | $2.00 | $8.00 | 50%+ | Automatic |
| GPT-5 (high) | $2.50 | $15.00 | 50%+ | Automatic |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 50% | 90% |
| Claude Opus 4.6 | $5.00 | $25.00 | 50% | 90% |
Cost in Practice
A typical coding agent session generates 100-200K tokens of context (file reads, tool outputs, conversation history). At those volumes:
- DeepSeek V3.2: ~$0.03-0.07 per session
- Claude Sonnet 4.6: ~$0.30-0.60 per session
- Claude Opus 4.6: ~$0.50-1.25 per session
- With prompt caching: 60-90% lower for repeated context
For teams running hundreds of coding agent sessions daily, the budget models save thousands per month. For occasional complex refactoring where accuracy is critical, paying for Opus makes sense. The emerging pattern: route simple tasks to cheap models, complex tasks to expensive ones.
Context Window Comparison
Context window size determines how much of your codebase the model can "see" at once. For coding agents that read files, run commands, and accumulate conversation history, context exhaustion is the primary failure mode.
| Model | Context Window | Max Output | Notes |
|---|---|---|---|
| GPT-5 (high) | 1M tokens | 128K | 1.05M context, largest output window |
| GPT-4.1 | 1M tokens | 32K | Largest stable context in production |
| Gemini 3.1 Pro | 1M tokens | 64K | Native 1M, no beta flag needed |
| Gemini 2.5 Flash | 1M tokens | 64K | Budget 1M option |
| MiniMax M2.5 | 1M tokens | N/A | Open-weight 1M context |
| Kimi K2.5 | 262K tokens | N/A | Open-weight, natively multimodal |
| Claude Opus 4.6 | 200K (1M beta) | 128K | 1M requires beta header, 2x input cost above 200K |
| Claude Sonnet 4.6 | 200K (1M beta) | 64K | Same 1M beta, 2x input cost above 200K |
| DeepSeek V3.2 | 131K tokens | 64K | Thinking mode supports 64K output |
| Qwen3-Coder-Next | 262K tokens | N/A | Runs locally on 48GB Mac |
Why Context Windows Matter for Coding
A medium-sized codebase exploration, reading 20 files, running tests, parsing output, easily generates 100-150K tokens of context. Models with 128K windows hit their limit partway through. Models with 200K windows handle most sessions. Models with 1M windows handle the largest monorepos without context compaction.
But there is a caveat: performance degrades before the advertised limit. A model claiming 200K tokens typically becomes unreliable around 130K, with sudden accuracy drops. Gemini and GPT-4.1's 1M windows have been tested more extensively at scale and show more graceful degradation.
Speed and Latency
For coding agents, output speed determines how fast the model iterates. A model generating 150 tok/s returns a 500-token function in 3.3 seconds. At 42 tok/s, it takes 12 seconds. Over hundreds of iterations in an agent loop, this compounds into minutes of wall-clock difference.
| Model | Output Speed | Time to First Token | Notes |
|---|---|---|---|
| Gemini 3.1 Flash-Lite | ~287 tok/s | 0.36s | Fastest reasoning model overall |
| GPT-5.3-Codex | ~240 tok/s | <1s | 2.93x faster inference (SemiAnalysis) |
| Gemini 2.5 Flash | ~150 tok/s | <0.5s | Strong speed/quality balance |
| Claude Sonnet 4.6 | ~77 tok/s | ~1.5s | Mid-range, best for most workflows |
| DeepSeek V3.2 | ~60 tok/s | ~1s | Good for the price |
| Claude Opus 4.6 | ~42 tok/s | ~2s | Slowest frontier, deepest reasoning |
| DeepSeek R1 | ~41 tok/s | ~1.5s | Reasoning mode slows output |
Speed matters most in agentic loops where the model makes dozens or hundreds of calls. For single-shot code generation, the difference between 42 and 150 tok/s is a few seconds. For a 50-turn agent session, it is 5-10 minutes of wall-clock time. SemiAnalysis attributes GPT-5.3-Codex's 2.93x speed advantage to hardware/software co-design, not just model architecture.
The practical implication: when you need to iterate rapidly (debugging, exploring approaches, running test loops), fast models like GPT-5.3-Codex or Gemini 2.5 Flash save real hours. When you need to get a complex multi-file change right on the first attempt, Opus 4.6's deeper reasoning is worth the latency.
Open Weight vs Closed Source for Coding
The capability gap between open-weight and closed-source models has largely closed. MiniMax M2.5 (open) scores 80.2% on SWE-bench Verified, matching closed-source frontier models. DeepSeek V3.2 scores 74.2% on Aider's polyglot benchmark, 2 points above Opus 4, at 1/50th the cost. The deployment tradeoffs remain significant.
Open-Weight Advantages
Self-host for data privacy. No API rate limits. Fine-tune on proprietary code. No vendor lock-in. DeepSeek V3.2 scores 74.2% on Aider polyglot for $1.30 total. Opus 4 costs $65.75 for 72%. Similar accuracy, 50x cheaper.
Closed-Source Advantages
No infrastructure overhead. Automatic scaling. Better tool use and instruction following (for now). Claude and GPT have more mature agent ecosystems. Prompt caching and batch APIs reduce costs without ops burden.
Best Open-Weight Models for Coding
| Model | Parameters | Aider Polyglot | License | Minimum Hardware |
|---|---|---|---|---|
| MiniMax M2.5 | MoE | N/A (80.2% SWE-bench) | Open weight | API or cloud GPU |
| DeepSeek V3.2 | 685B (37B active) | 74.2% ($1.30 total) | MIT | Multi-GPU or API |
| Kimi K2.5 | 1T MoE (32B active) | 59.1% ($1.24 total) | Open weight | Multi-GPU cluster |
| DeepSeek R1 | 671B MoE | 71.4% ($4.80 total) | MIT | Multi-GPU or API |
| Qwen3-Coder-Next | 80B MoE (3B active) | SWE-rebench #1 | Apache 2.0 | 48GB Mac (Q4) |
Qwen3-Coder-Next deserves special attention. It hit #1 on SWE-rebench Pass@5 at 64.6%, beating every closed model, with only 3B active parameters. It runs on a 48GB Mac at Q4 quantization. For developers who want a local coding model, this is the current best option.
How Subagents Improve Any Base LLM
The biggest finding from our production testing: the agent scaffold matters more than the model. Cognition measured that coding agents spend 60% of their time searching for code, not writing it. Anthropic reported 90% improvement in multi-agent architectures over single-model approaches.
Two bottlenecks limit every coding LLM. Search accuracy: the model needs to find the right code to modify. Apply speed: the model needs to merge edits into existing files without corruption. Specialized subagents handle both better than the base LLM alone.
WarpGrep: Semantic Code Search
MCP server that indexes your codebase and finds code by meaning, not keywords. Uses 8 parallel tool calls per turn across 4 turns in under 6 seconds. On SWE-bench Pro, Opus 4.6 + WarpGrep scores 57.5% vs 55.4% without it, a +2.1 point improvement.
Fast Apply: 10,500 tok/s Code Merging
7B model trained on code merging. Takes an original file and an edit snippet, returns the merged result at 10,500 tokens/second with 98% accuracy. Claude search-replace: 86% accuracy at 50-100 tok/s. Cursor apply: 85% at 1,000 tok/s.
These subagents work with any base LLM. Pair WarpGrep with Gemini 3.1 Pro for cost-efficient large-codebase work. Pair Fast Apply with Claude Sonnet 4.6 for high-throughput code editing. The base model handles reasoning; the subagents handle the mechanical work that models are overqualified for.
Frequently Asked Questions
What is the best LLM for coding in 2026?
Claude Opus 4.6 leads SWE-bench Verified at 81.4%. GPT-5 leads Aider's polyglot benchmark at 88% across six languages. Gemini 3.1 Pro scores 80.6% on SWE-bench with a native 1M context window at $2/$12 per MTok. The best choice depends on your priorities: Opus for complex multi-file reasoning, GPT-5 for speed and breadth across languages, Gemini for cost efficiency with large context. For most teams, Claude Sonnet 4.6 at $3/$15 per MTok offers 95% of Opus performance at 60% of the cost.
Is Claude or GPT better for coding?
Claude Opus 4.6 excels at complex reasoning, multi-file refactoring, and understanding ambiguous developer intent. GPT-5 leads Aider's polyglot benchmark (88% vs ~72% for Opus 4) and runs 2.5x faster. Most productive developers use both: Claude for depth-first problems, GPT for speed-first iteration. The agent scaffold and tooling around the model matters more than the model weights.
What is the best open-source LLM for coding?
MiniMax M2.5 leads open-weight models on SWE-bench Verified at 80.2%, matching closed-source frontier models. DeepSeek V3.2 scores 74.2% on Aider's polyglot benchmark at $0.27/$1.00 per MTok, costing $1.30 for the full run vs $65.75 for Opus 4. For local deployment, Qwen3-Coder-Next (80B MoE, 3B active) fits on a 48GB Mac and hit #1 on SWE-rebench Pass@5 at 64.6%.
How much does it cost to use LLMs for coding?
Costs range from $0.12/MTok input (Qwen3-Coder-Next via API) to $5/MTok (Claude Opus 4.6). A typical coding session generates 50-200K tokens of context. DeepSeek V3.2 costs roughly $0.03-0.07 per session. Claude Sonnet 4.6 costs about $0.30-0.60 per session. Prompt caching reduces costs up to 90% for repeated context.
Does context window size matter for coding LLMs?
Yes. A typical codebase exploration generates 50-200K tokens. Models with 128K windows hit their limit partway through. Models with 200K handle most sessions. Models with 1M (Gemini, GPT-5, GPT-4.1) handle the largest monorepos. But performance degrades before the advertised limit: a 200K model typically becomes unreliable around 130K.
What benchmarks matter for coding LLMs?
SWE-bench Verified tests real-world software engineering by resolving actual GitHub issues. Aider's polyglot benchmark tests code editing across six languages with 225 Exercism exercises. LiveCodeBench evaluates competitive programming with fresh problems (no contamination risk). Terminal-Bench 2.0 measures agentic terminal tasks. HumanEval is saturated (top models score 95%+) and no longer differentiates frontier models.
Related Reading
Make Any Coding LLM Better with Subagents
WarpGrep gives your coding agent semantic search over your codebase. Fast Apply merges edits at 10,500 tok/s with 98% accuracy. Both work with Claude, GPT, Gemini, or any model via API.