February 2026 dropped three frontier open-source coding models in one month. The gap between open-weight and proprietary models compressed to single-digit percentages. This is the complete comparison with real benchmarks, VRAM needs, and API pricing.
The February 2026 Wave
Three things happened in February 2026 that changed the open-source coding model landscape:
- MiniMax M2.5 hit 80.2% on SWE-bench Verified, within 0.6 points of Claude Opus 4.6 (80.8%). A 229B MoE model with only 10B active parameters.
- GLM-5 from Zhipu AI shipped a 744B model under MIT license, scoring 77.8% SWE-bench Verified and 94.2% HumanEval. Trained entirely on Huawei Ascend chips.
- Qwen3-Coder-Next proved you could hit 70.6% SWE-bench Verified with 3B active parameters from an 80B MoE, running on a single machine with 46GB memory.
Add Kimi K2.5 (76.8%, launched late January), DeepSeek V3.2 (73%, late 2025), and the original Qwen3-Coder 480B (67-70%), and the open-source tier now has six models that would have led all benchmarks just 12 months ago.
The contamination caveat
OpenAI has stopped reporting SWE-bench Verified scores after confirming that all frontier models show training data contamination on the dataset. The numbers below are still useful for relative comparison, but absolute percentages should be taken with appropriate skepticism. SWE-Bench Pro (1,865 multi-language tasks) is the better benchmark for production readiness.
Master Comparison: Open-Source Coding Models (March 2026)
| Model | Total / Active Params | SWE-bench Verified | Context | License |
|---|---|---|---|---|
| MiniMax M2.5 | 229B / 10B | 80.2% | 200K | Modified MIT |
| GLM-5 | 744B / 40B | 77.8% | 204K | MIT |
| Kimi K2.5 | 1T / 32B | 76.8% | 256K | Modified MIT |
| DeepSeek V3.2 | 685B / ~37B | 73.0% | 128K | MIT |
| Qwen3-Coder-Next | 80B / 3B | 70.6% | 256K (1M ext.) | Apache 2.0 |
| Qwen3-Coder 480B | 480B / 35B | 67-70% | 256K (1M ext.) | Apache 2.0 |
| Codestral 25.08 | 22B / 22B (dense) | ~40%* | 256K | Non-production |
| StarCoder 2 15B | 15B / 15B (dense) | ~30%* | 16K | BigCode OpenRAIL-M |
* Estimated from HumanEval/MBPP, not official SWE-bench submissions. SWE-bench Verified scores are self-reported by model providers using varying scaffolds.
| Model | HumanEval | LiveCodeBench | SWE-bench Pro (SEAL) | VRAM (FP16) |
|---|---|---|---|---|
| MiniMax M2.5 | ~92% | ~65% | 36.8% (as M2.1) | 457GB |
| GLM-5 | 94.2% | 52.0% | 9.7% (as GLM-4.6) | ~1.5TB |
| Kimi K2.5 | ~90% | 85.0% | 27.7% (K2 Instruct) | ~2TB |
| DeepSeek V3.2 | ~92% | ~70% | 15.6% | ~1.4TB |
| Qwen3-Coder-Next | ~85% | ~60% | ~35% (est.) | 46GB |
| Qwen3-Coder 480B | ~90% | ~62% | 38.7% | ~960GB |
VRAM listed is for full-precision inference. Quantized deployments reduce requirements by 50-75%. SWE-Bench Pro SEAL scores use standardized scaffolding with a 250-turn limit.
MiniMax M2.5: Highest SWE-bench Score Among Open Weights
MiniMax M2.5 is the open-weight model closest to proprietary frontier performance. At 80.2% on SWE-bench Verified, it sits 0.6 points behind Claude Opus 4.6 (80.8%) and ahead of GPT-5.2 (80.0%).
The architecture is a 229B MoE with 10B active parameters per token. End-to-end runtime on SWE-bench dropped from 31.3 minutes (M2.1) to 22.8 minutes, on par with Claude Opus 4.6's 22.9 minutes.
MiniMax M2.5 also leads Opus 4.6 by over 13 points on the Berkeley Function Calling Leaderboard (76.8% vs ~63%), making it strong for agentic workflows that require tool calling.
Hardware Requirements
Full FP16 inference requires ~457GB VRAM (2x NVIDIA B200 or 4x H100). With Unsloth dynamic 3-bit GGUF quantization, size drops to ~101GB. Not a model you run on a laptop, but deployable on a single multi-GPU node.
Real-world adoption
MiniMax reports that M2.5-generated code accounts for 80% of newly committed code in their internal development. The model excels at office productivity tasks too, achieving 59% average win rate against mainstream models on Word, PowerPoint, and Excel financial modeling tasks.
GLM-5 (Zhipu AI): MIT License, 744B Parameters, Trained on Ascend
GLM-5 is the largest open-source coding model available. At 744B total parameters (40B active), trained on 28.5 trillion tokens, it represents Zhipu AI's fifth-generation model. The headline: it was trained entirely on Huawei Ascend chips, not NVIDIA GPUs.
On SWE-bench Verified, GLM-5 hits 77.8%, the highest score for any MIT-licensed model. HumanEval Pass@1 at 94.2% slightly edges Claude Opus 4.5 (93.8%). The model supports 204,800 tokens of context and can generate up to 128,000 tokens in a single output.
Tradeoffs
GLM-5's LiveCodeBench score (52.0%) is notably lower than its SWE-bench performance. Its predecessor GLM-4.7 actually scored higher on LiveCodeBench (84.9%), suggesting GLM-5 was optimized for different capabilities. If your workload involves competitive programming or algorithm-heavy tasks, GLM-4.7 may still be the better Zhipu model.
Hardware Requirements
Full deployment requires ~1.5TB VRAM. The recommended setup is 8x H200 (1,128GB total). Minimum viable: 8x NVIDIA B200. GLM-5 integrates DeepSeek Sparse Attention for efficient long-context inference, but this is still a model that requires serious infrastructure.
Qwen3-Coder-Next: 70.6% SWE-bench with 3B Active Parameters
Qwen3-Coder-Next is the efficiency story of 2026. An 80B MoE model that activates only 3B parameters per token (technically 3.9B), achieving 70.6% on SWE-bench Verified. That puts it ahead of DeepSeek V3 (70.2%) while using 10-20x fewer active parameters.
The architecture uses 512 experts with 10 selected per token, plus a shared expert. Hybrid attention combines Gated DeltaNet and Gated Attention for efficient context modeling. The result: a model that runs on 46GB of unified memory (combined RAM + VRAM).
Running Locally
With 2-bit quantization via Unsloth, you need about 30GB. The 30B variant (Qwen3-Coder Flash) runs on 18GB for 6+ tokens/second. Both variants support 256K context natively, extendable to 1M tokens using YaRN.
On SWE-bench Multilingual, it scores 62.8%. On SWE-bench Pro, 44.3%. On Terminal-Bench 2.0, 36.2%. On SecCodeBench, 61.2%. These are strong numbers for a model designed to run on consumer hardware.
Best for local development
If you want to run a coding model locally without a GPU cluster, Qwen3-Coder-Next is the clear pick. A MacBook Pro with 48GB unified memory handles it. The 30B Flash variant works on 18GB VRAM (RTX 4090 or M2 Pro). No other model in this tier approaches this efficiency.
Qwen3-Coder 480B: The Full-Size Agentic Coder
Qwen3-Coder 480B is the full-size sibling. A 480B MoE with 35B active parameters, released under Apache 2.0 license. It was the first model in the Qwen3-Coder family, launched January 31, 2025.
On SWE-bench Verified, it scores 67.0% with standard scaffolding and 69.6% with OpenHands at 500 turns. On SWE-Bench Pro SEAL, it hits 38.7%, placing 8th overall and 2nd among open-source models behind MiniMax M2.1.
The 256K native context (extendable to 1M) supports full repository-scale operations. This is the model to run if you need maximum agentic coding capability from an open-source model and have the infrastructure for it (~960GB VRAM for FP16).
Kimi K2.5 (Moonshot AI): Visual Coding + LiveCodeBench Leader
Kimi K2.5 is Moonshot AI's open-weight model, released January 27, 2026. It is a 1-trillion-parameter MoE with 32B active per token. The differentiator: native multimodal capability with strong visual coding support.
On SWE-bench Verified, K2.5 scores 76.8%. But the standout metric is LiveCodeBench at 85.0%, significantly ahead of every other open-source model (and most proprietary ones, since Claude Opus 4.5 scores 64.0% on the same benchmark).
K2.5 was built through continual pretraining on ~15 trillion mixed visual and text tokens. It supports image-to-code, video-to-code, and visual debugging workflows. The companion tool, Kimi Code, integrates with VS Code, Cursor, and Zed.
License and Deployment
Released under modified MIT license. Weights are on Hugging Face. Supported by vLLM, SGLang, and TensorRT-LLM. The 1T total parameter count means full deployment requires substantial infrastructure (~2TB VRAM for FP16).
DeepSeek V3.2: Reasoning Depth and IOI Gold
DeepSeek V3.2 shipped in late 2025 and remains competitive. A 685B MoE under MIT license, it scores 73% on SWE-bench Verified. The high-compute variant (V3.2-Speciale) surpasses GPT-5 on reasoning benchmarks and achieved gold-medal performance at both the 2025 International Mathematical Olympiad and International Olympiad in Informatics.
V3.2 introduced DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that reduces training and inference cost while maintaining output quality. The 128K context window is the smallest among the top-tier models here, which can be limiting for large codebases.
On SWE-Bench Pro SEAL, DeepSeek V3.2 scores 15.6%, well below the top models. This gap between Verified and Pro scores suggests the model benefits more from data familiarity than from generalizable coding ability.
Mercury 2 (Inception Labs): The Speed Outlier
Mercury 2 is not open-source, but it deserves mention because it represents a fundamentally different approach to code generation. Instead of autoregressive decoding (one token at a time), Mercury 2 uses diffusion-based parallel refinement, producing multiple tokens simultaneously.
The headline number: 1,000 tokens per second. For comparison, Claude 4.5 Haiku Reasoning outputs ~89 tokens/sec and GPT-5 Mini ~71 tokens/sec. Mercury 2's quality benchmarks: 91.1 on AIME 2025, 73.6 on GPQA, 67.3 on LiveCodeBench, 38.4 on SciCode.
Inception raised $50M from Menlo Ventures, NVIDIA, Microsoft, Snowflake, and Databricks. The API is OpenAI-compatible, so you can swap it into existing stacks without rewrites. Not a model you self-host, but if your bottleneck is iteration speed (prompt, review, tweak cycles), Mercury 2 removes the latency constraint.
Legacy Models: Codestral and StarCoder 2
Two models that dominated 2024-2025 are still available but no longer competitive at the frontier.
Codestral (Mistral AI)
Codestral 25.08 is a 22B dense model with 256K context. It scores 86.6% on HumanEval and 91.2% on MBPP. It still leads the LMsys copilot arena leaderboard for fill-in-the-middle completion, and Mistral positions it for latency-sensitive production environments. If you need fast FIM completion rather than agentic coding, Codestral remains strong. But for SWE-bench-class tasks, the MoE models above surpass it.
StarCoder 2
StarCoder 2 (15B, BigCode OpenRAIL-M license) was a milestone in 2024. It still works for cost-efficient code assistance with strict data-governance requirements. The 15B model matches CodeLlama-34B. But in the context of 2026 frontier performance, it is firmly a budget option, not a competitive one.
Best For: Which Model for Which Use Case
Best overall: MiniMax M2.5
80.2% SWE-bench Verified. Closest to proprietary frontier. Strong tool calling (76.8% BFCL). 22.8min avg solve time. Requires 4x H100 or 2x B200.
Best MIT license: GLM-5
77.8% SWE-bench Verified, 94.2% HumanEval. Full MIT license. 744B/40B active. Best choice if license permissiveness is a hard requirement.
Best for local dev: Qwen3-Coder-Next
70.6% SWE-bench Verified with 3B active parameters. Runs on 46GB memory. 30B Flash variant runs on 18GB. The only frontier-adjacent model for consumer hardware.
Best for visual coding: Kimi K2.5
76.8% SWE-bench Verified, 85.0% LiveCodeBench. Native multimodal: image-to-code, video-to-code, visual debugging. Kimi Code integrates with VS Code, Cursor, Zed.
Best for reasoning: DeepSeek V3.2
73% SWE-bench Verified. IOI 2025 gold medal. V3.2-Speciale surpasses GPT-5 on reasoning benchmarks. Best when correctness on algorithmic tasks matters most.
Best for agentic workflows: Qwen3-Coder 480B
38.7% SWE-Bench Pro SEAL (2nd among open-source). Apache 2.0 license. 256K-1M context. Built for repository-scale multi-file operations.
Best for speed: Mercury 2 (not open-source)
1,000 tokens/sec via diffusion LM. 5x faster than autoregressive speed models. 67.3% LiveCodeBench. OpenAI-compatible API. Best for rapid iteration loops.
Best for FIM completion: Codestral 25.08
#1 on LMsys copilot arena. 22B dense model, 256K context. 86.6% HumanEval. Built for fill-in-the-middle, not agentic coding. Fast, focused, limited.
API Pricing Comparison
All of these models are available via hosted APIs. Open-source does not mean you have to self-host. The cost advantage over proprietary models is significant: 70-95% cheaper in most cases.
| Model | Input | Output | Notes |
|---|---|---|---|
| Qwen3-Coder 480B | $0.22 | $1.00 | Cheapest frontier option |
| MiniMax M2.5 | $0.30 | $1.20 | Lightning variant: $2.40 output, 2x speed |
| GLM-5 | $0.60 | $2.20 | Free tier at api.z.ai |
| Kimi K2.5 | ~$0.40 | ~$1.50 | Via Moonshot API |
| DeepSeek V3.2 | ~$0.50 | ~$1.80 | DeepSeek API + third-party providers |
| Mercury 2 | ~$0.30 | ~$1.00 | 1,000 tok/s throughput |
Prices from provider announcements and third-party hosting platforms. Actual costs vary by provider and volume. For comparison, Claude Opus 4.6 costs $15/$75 per million tokens and GPT-5.2 costs ~$15/$60.
How WarpGrep Fits In
Every model above shares the same bottleneck on hard coding tasks: finding the right code in large repositories. On SWE-Bench Pro, context overflow causes 35.6% of failures for top models. Coding agents spend 60%+ of their time on search.
WarpGrep v2 is an RL-trained search subagent that runs alongside any coding model. It operates in its own context window, issues up to 8 parallel tool calls per turn, and returns only relevant file spans. The main model never sees files WarpGrep rejected, keeping context clean.
| Base Model | Without WarpGrep | With WarpGrep v2 | Delta |
|---|---|---|---|
| Codex 5.3 (CLI) | 57.0% | 59.1% | +2.1 |
| MiniMax M2.5 | 55.4% | 57.6% | +2.2 |
| Opus 4.6 | 55.4% | 57.5% | +2.1 |
WarpGrep is model-agnostic. It works with every model listed on this page, proprietary or open-source. Pairing it with Opus 4.6 makes the system 15.6% cheaper and 28% faster on SWE-Bench Pro tasks, because the expensive model spends less time doing its own search.
Frequently Asked Questions
What is the best open-source coding model in 2026?
MiniMax M2.5 leads SWE-bench Verified at 80.2%, within 0.6 points of Claude Opus 4.6. GLM-5 (77.8%) is the best MIT-licensed option. Qwen3-Coder-Next (70.6%) is the best model you can run locally on consumer hardware with only 3B active parameters.
Can open-source coding models compete with Claude and GPT?
Yes. MiniMax M2.5 at 80.2% is 0.6 points behind Claude Opus 4.6 (80.8%) and ahead of GPT-5.2 (80.0%). The gap has compressed to single-digit percentages across the board. On specific tasks (tool calling, LiveCodeBench), some open-source models already lead.
What hardware do I need to run Qwen3-Coder-Next locally?
46GB of unified memory for the full 80B model. 30GB with 2-bit quantization. The 30B Flash variant runs on 18GB (RTX 4090 or MacBook Pro with M2 Pro). Set context to 32K tokens if you hit memory limits.
What is the cheapest open-source coding model API?
Qwen3-Coder 480B at $0.22/M input, $1.00/M output. That is 95%+ cheaper than Claude Opus 4.6 ($15/$75 per million tokens).
What is Qwen3-Coder-Next vs Qwen3-Coder?
Qwen3-Coder 480B (35B active) optimizes for maximum benchmark performance. Qwen3-Coder-Next (80B total, 3B active) optimizes for local development efficiency, achieving 70.6% SWE-bench Verified with 10x fewer active parameters using hybrid attention and ultra-sparse MoE.
Is GLM-5 really fully open-source?
Yes. MIT license. Weights on Hugging Face and ModelScope. Free API at api.z.ai. 744B parameters, trained on Huawei Ascend chips. The most permissive license among the frontier open-source coding models.
Which model has the largest context window?
Qwen3-Coder models support 256K natively, extendable to 1M with YaRN. Kimi K2.5 supports 256K. GLM-5 supports 204K. MiniMax M2.5 supports 200K. DeepSeek V3.2 has the smallest at 128K.
What is Mercury 2?
Mercury 2 from Inception Labs uses diffusion-based language modeling instead of autoregressive decoding. It generates 1,000 tokens per second (5x faster than speed-optimized alternatives). It scores 67.3% on LiveCodeBench. It is not open-source but signals where fast inference is heading.
WarpGrep: Search Subagent for Any Coding Model
WarpGrep v2 lifts every model it is paired with by 2-4 points on SWE-Bench Pro. It runs in its own context window, issues 8 parallel tool calls per turn, and makes your coding agent cheaper and faster. Works with open-source and proprietary models.