Best Open-Source Coding Model 2026: GLM-5 vs MiniMax M2.5 vs Qwen3-Coder vs Kimi K2.5

February 2026 dropped three frontier open-source coding models in one month. The gap between open-weight and proprietary models compressed to single-digit percentages. This is the complete comparison with real benchmarks, VRAM needs, and API pricing.

Benchmarks verified March 1, 2026

80.2%

Top open-source SWE-bench Verified (MiniMax M2.5)

Frontier open-source coding models

Smallest active params (Qwen3-Coder-Next)

$0.22/M

Cheapest API input (Qwen3-Coder)

The February 2026 Wave

This page ranks open-source coding models specifically. For the full field including proprietary frontier models, see best LLM for coding. Three things happened in February 2026 that changed the open-source coding model landscape:

MiniMax M2.5 hit 80.2% on SWE-bench Verified, within 0.6 points of Claude Opus 4.6 (80.8%). A 229B MoE model with only 10B active parameters.
GLM-5 from Zhipu AI shipped a 744B model under MIT license, scoring 77.8% SWE-bench Verified and 94.2% HumanEval. Trained entirely on Huawei Ascend chips.
Qwen3-Coder-Next proved you could hit 70.6% SWE-bench Verified with 3B active parameters from an 80B MoE, running on a single machine with 46GB memory.

Add Kimi K2.5 (76.8%, launched late January), DeepSeek V3.2 (73%, late 2025), and the original Qwen3-Coder 480B (67-70%), and the open-source tier now has six models that would have led all benchmarks just 12 months ago.

The contamination caveat

OpenAI has stopped reporting SWE-bench Verified scores after confirming that all frontier models show training data contamination on the dataset. The numbers below are still useful for relative comparison, but absolute percentages should be taken with appropriate skepticism. SWE-Bench Pro (1,865 multi-language tasks) is the better benchmark for production readiness.

Master Comparison: Open-Source Coding Models (March 2026)

Open-Source Coding Models: Parameters, Benchmarks, and Requirements

Model	Total / Active Params	SWE-bench Verified	Context	License
MiniMax M2.5	229B / 10B	80.2%	200K	Modified MIT
GLM-5	744B / 40B	77.8%	204K	MIT
Kimi K2.5	1T / 32B	76.8%	256K	Modified MIT
DeepSeek V3.2	685B / ~37B	73.0%	128K	MIT
Qwen3-Coder-Next	80B / 3B	70.6%	256K (1M ext.)	Apache 2.0
Qwen3-Coder 480B	480B / 35B	67-70%	256K (1M ext.)	Apache 2.0
Codestral 25.08	22B / 22B (dense)	~40%*	256K	Non-production
StarCoder 2 15B	15B / 15B (dense)	~30%*	16K	BigCode OpenRAIL-M

* Estimated from HumanEval/MBPP, not official SWE-bench submissions. SWE-bench Verified scores are self-reported by model providers using varying scaffolds.

Additional Benchmarks

Model	HumanEval	LiveCodeBench	SWE-bench Pro (SEAL)	VRAM (FP16)
MiniMax M2.5	~92%	~65%	36.8% (as M2.1)	457GB
GLM-5	94.2%	52.0%	9.7% (as GLM-4.6)	~1.5TB
Kimi K2.5	~90%	85.0%	27.7% (K2 Instruct)	~2TB
DeepSeek V3.2	~92%	~70%	15.6%	~1.4TB
Qwen3-Coder-Next	~85%	~60%	~35% (est.)	46GB
Qwen3-Coder 480B	~90%	~62%	38.7%	~960GB

VRAM listed is for full-precision inference. Quantized deployments reduce requirements by 50-75%. SWE-Bench Pro SEAL scores use standardized scaffolding with a 250-turn limit.

MiniMax M2.5: Highest SWE-bench Score Among Open Weights

MiniMax M2.5 is the open-weight model closest to proprietary frontier performance. At 80.2% on SWE-bench Verified, it sits 0.6 points behind Claude Opus 4.6 (80.8%) and ahead of GPT-5.2 (80.0%).

80.2%

SWE-bench Verified

229B

Total parameters (10B active)

200K

Context window

22.8min

Avg SWE-bench solve time

The architecture is a 229B MoE with 10B active parameters per token. End-to-end runtime on SWE-bench dropped from 31.3 minutes (M2.1) to 22.8 minutes, on par with Claude Opus 4.6's 22.9 minutes.

MiniMax M2.5 also leads Opus 4.6 by over 13 points on the Berkeley Function Calling Leaderboard (76.8% vs ~63%), making it strong for agentic workflows that require tool calling.

Hardware Requirements

Full FP16 inference requires ~457GB VRAM (2x NVIDIA B200 or 4x H100). With Unsloth dynamic 3-bit GGUF quantization, size drops to ~101GB. Not a model you run on a laptop, but deployable on a single multi-GPU node.

Real-world adoption

MiniMax reports that M2.5-generated code accounts for 80% of newly committed code in their internal development. The model excels at office productivity tasks too, achieving 59% average win rate against mainstream models on Word, PowerPoint, and Excel financial modeling tasks.

GLM-5 (Zhipu AI): MIT License, 744B Parameters, Trained on Ascend

GLM-5 is the largest open-source coding model available. At 744B total parameters (40B active), trained on 28.5 trillion tokens, it represents Zhipu AI's fifth-generation model. The headline: it was trained entirely on Huawei Ascend chips, not NVIDIA GPUs.

77.8%

SWE-bench Verified (#1 open-source MIT)

94.2%

HumanEval Pass@1

744B

Total parameters (40B active)

204K

Context window (128K output)

On SWE-bench Verified, GLM-5 hits 77.8%, the highest score for any MIT-licensed model. HumanEval Pass@1 at 94.2% slightly edges Claude Opus 4.5 (93.8%). The model supports 204,800 tokens of context and can generate up to 128,000 tokens in a single output.

Tradeoffs

GLM-5's LiveCodeBench score (52.0%) is notably lower than its SWE-bench performance. Its predecessor GLM-4.7 actually scored higher on LiveCodeBench (84.9%), suggesting GLM-5 was optimized for different capabilities. If your workload involves competitive programming or algorithm-heavy tasks, GLM-4.7 may still be the better Zhipu model.

Hardware Requirements

Full deployment requires ~1.5TB VRAM. The recommended setup is 8x H200 (1,128GB total). Minimum viable: 8x NVIDIA B200. GLM-5 integrates DeepSeek Sparse Attention for efficient long-context inference, but this is still a model that requires serious infrastructure.

Qwen3-Coder-Next: 70.6% SWE-bench with 3B Active Parameters

Qwen3-Coder-Next is the efficiency story of 2026. An 80B MoE model that activates only 3B parameters per token (technically 3.9B), achieving 70.6% on SWE-bench Verified. That puts it ahead of DeepSeek V3 (70.2%) while using 10-20x fewer active parameters.

70.6%

SWE-bench Verified (3B active)

80B

Total params (3.9B active)

256K

Native context (1M with YaRN)

46GB

Memory requirement

The architecture uses 512 experts with 10 selected per token, plus a shared expert. Hybrid attention combines Gated DeltaNet and Gated Attention for efficient context modeling. The result: a model that runs on 46GB of unified memory (combined RAM + VRAM).

Running Locally

With 2-bit quantization via Unsloth, you need about 30GB. The 30B variant (Qwen3-Coder Flash) runs on 18GB for 6+ tokens/second. Both variants support 256K context natively, extendable to 1M tokens using YaRN.

On SWE-bench Multilingual, it scores 62.8%. On SWE-bench Pro, 44.3%. On Terminal-Bench 2.0, 36.2%. On SecCodeBench, 61.2%. These are strong numbers for a model designed to run on consumer hardware.

Best for local development

If you want to run a coding model locally without a GPU cluster, Qwen3-Coder-Next is the clear pick. A MacBook Pro with 48GB unified memory handles it. The 30B Flash variant works on 18GB VRAM (RTX 4090 or M2 Pro). No other model in this tier approaches this efficiency.

Qwen3-Coder 480B: The Full-Size Agentic Coder

Qwen3-Coder 480B is the full-size sibling. A 480B MoE with 35B active parameters, released under Apache 2.0 license. It was the first model in the Qwen3-Coder family, launched January 31, 2025.

67-70%

SWE-bench Verified (scaffold-dependent)

38.7%

SWE-Bench Pro (SEAL)

480B

Total params (35B active)

On SWE-bench Verified, it scores 67.0% with standard scaffolding and 69.6% with OpenHands at 500 turns. On SWE-Bench Pro SEAL, it hits 38.7%, placing 8th overall and 2nd among open-source models behind MiniMax M2.1.

The 256K native context (extendable to 1M) supports full repository-scale operations. This is the model to run if you need maximum agentic coding capability from an open-source model and have the infrastructure for it (~960GB VRAM for FP16).

Kimi K2.5 (Moonshot AI): Visual Coding + LiveCodeBench Leader

Kimi K2.5 is Moonshot AI's open-weight model, released January 27, 2026. It is a 1-trillion-parameter MoE with 32B active per token. The differentiator: native multimodal capability with strong visual coding support.

76.8%

SWE-bench Verified

85.0%

LiveCodeBench (best in class)

Total params (32B active)

256K

Context window

On SWE-bench Verified, K2.5 scores 76.8%. But the standout metric is LiveCodeBench at 85.0%, significantly ahead of every other open-source model (and most proprietary ones, since Claude Opus 4.5 scores 64.0% on the same benchmark).

K2.5 was built through continual pretraining on ~15 trillion mixed visual and text tokens. It supports image-to-code, video-to-code, and visual debugging workflows. The companion tool, Kimi Code, integrates with VS Code, Cursor, and Zed.

License and Deployment

Released under modified MIT license. Weights are on Hugging Face. Supported by vLLM, SGLang, and TensorRT-LLM. The 1T total parameter count means full deployment requires substantial infrastructure (~2TB VRAM for FP16).

DeepSeek V3.2: Reasoning Depth and IOI Gold

DeepSeek V3.2 shipped in late 2025 and remains competitive. A 685B MoE under MIT license, it scores 73% on SWE-bench Verified. The high-compute variant (V3.2-Speciale) surpasses GPT-5 on reasoning benchmarks and achieved gold-medal performance at both the 2025 International Mathematical Olympiad and International Olympiad in Informatics.

73.0%

SWE-bench Verified

685B

Total parameters (MIT license)

128K

Context window

Gold

IOI 2025 medal

V3.2 introduced DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that reduces training and inference cost while maintaining output quality. The 128K context window is the smallest among the top-tier models here, which can be limiting for large codebases.

On SWE-Bench Pro SEAL, DeepSeek V3.2 scores 15.6%, well below the top models. This gap between Verified and Pro scores suggests the model benefits more from data familiarity than from generalizable coding ability.

Mercury 2 (Inception Labs): The Speed Outlier

Mercury 2 is not open-source, but it deserves mention because it represents a fundamentally different approach to code generation. Instead of autoregressive decoding (one token at a time), Mercury 2 uses diffusion-based parallel refinement, producing multiple tokens simultaneously.

1,000

Tokens per second

67.3%

LiveCodeBench

Faster than speed-optimized ARMs

The headline number: 1,000 tokens per second. For comparison, Claude 4.5 Haiku Reasoning outputs ~89 tokens/sec and GPT-5 Mini ~71 tokens/sec. Mercury 2's quality benchmarks: 91.1 on AIME 2025, 73.6 on GPQA, 67.3 on LiveCodeBench, 38.4 on SciCode.

Inception raised $50M from Menlo Ventures, NVIDIA, Microsoft, Snowflake, and Databricks. The API is OpenAI-compatible, so you can swap it into existing stacks without rewrites. Not a model you self-host, but if your bottleneck is iteration speed (prompt, review, tweak cycles), Mercury 2 removes the latency constraint.

Legacy Models: Codestral and StarCoder 2

Two models that dominated 2024-2025 are still available but no longer competitive at the frontier.

Codestral (Mistral AI)

Codestral 25.08 is a 22B dense model with 256K context. It scores 86.6% on HumanEval and 91.2% on MBPP. It still leads the LMsys copilot arena leaderboard for fill-in-the-middle completion, and Mistral positions it for latency-sensitive production environments. If you need fast FIM completion rather than agentic coding, Codestral remains strong. But for SWE-bench-class tasks, the MoE models above surpass it.

StarCoder 2

StarCoder 2 (15B, BigCode OpenRAIL-M license) was a milestone in 2024. It still works for cost-efficient code assistance with strict data-governance requirements. The 15B model matches CodeLlama-34B. But in the context of 2026 frontier performance, it is firmly a budget option, not a competitive one.

Best For: Which Model for Which Use Case

Best overall: MiniMax M2.5

80.2% SWE-bench Verified. Closest to proprietary frontier. Strong tool calling (76.8% BFCL). 22.8min avg solve time. Requires 4x H100 or 2x B200.

Best MIT license: GLM-5

77.8% SWE-bench Verified, 94.2% HumanEval. Full MIT license. 744B/40B active. Best choice if license permissiveness is a hard requirement.

Best for local dev: Qwen3-Coder-Next

70.6% SWE-bench Verified with 3B active parameters. Runs on 46GB memory. 30B Flash variant runs on 18GB. The only frontier-adjacent model for consumer hardware.

Best for visual coding: Kimi K2.5

76.8% SWE-bench Verified, 85.0% LiveCodeBench. Native multimodal: image-to-code, video-to-code, visual debugging. Kimi Code integrates with VS Code, Cursor, Zed.

Best for reasoning: DeepSeek V3.2

73% SWE-bench Verified. IOI 2025 gold medal. V3.2-Speciale surpasses GPT-5 on reasoning benchmarks. Best when correctness on algorithmic tasks matters most.

Best for agentic workflows: Qwen3-Coder 480B

38.7% SWE-Bench Pro SEAL (2nd among open-source). Apache 2.0 license. 256K-1M context. Built for repository-scale multi-file operations.

Best for speed: Mercury 2 (not open-source)

1,000 tokens/sec via diffusion LM. 5x faster than autoregressive speed models. 67.3% LiveCodeBench. OpenAI-compatible API. Best for rapid iteration loops.

Best for FIM completion: Codestral 25.08

#1 on LMsys copilot arena. 22B dense model, 256K context. 86.6% HumanEval. Built for fill-in-the-middle, not agentic coding. Fast, focused, limited.

API Pricing Comparison

All of these models are available via hosted APIs. Open-source does not mean you have to self-host. The cost advantage over proprietary models is significant: 70-95% cheaper in most cases.

API Pricing per Million Tokens (March 2026)

Model	Input	Output	Notes
Qwen3-Coder 480B	$0.22	$1.00	Cheapest frontier option
MiniMax M2.5	$0.30	$1.20	Lightning variant: $2.40 output, 2x speed
GLM-5	$0.60	$2.20	Free tier at api.z.ai
Kimi K2.5	~$0.40	~$1.50	Via Moonshot API
DeepSeek V3.2	~$0.50	~$1.80	DeepSeek API + third-party providers
Mercury 2	~$0.30	~$1.00	1,000 tok/s throughput

Prices from provider announcements and third-party hosting platforms. Actual costs vary by provider and volume. For comparison, Claude Opus 4.6 costs $15/$75 per million tokens and GPT-5.2 costs ~$15/$60.

How WarpGrep Fits In

Every model above shares the same bottleneck on hard coding tasks: finding the right code in large repositories. On SWE-Bench Pro, context overflow causes 35.6% of failures for top models. Coding agents spend 60%+ of their time on search.

WarpGrep v2 is an RL-trained search subagent that runs alongside any coding model. It operates in its own context window, issues up to 8 parallel tool calls per turn, and returns only relevant file spans. The main model never sees files WarpGrep rejected, keeping context clean.

WarpGrep v2 Impact on SWE-Bench Pro (Morph internal)

Base Model	Without WarpGrep	With WarpGrep v2	Delta
Codex 5.3 (CLI)	57.0%	59.1%	+2.1
MiniMax M2.5	55.4%	57.6%	+2.2
Opus 4.6	55.4%	57.5%	+2.1

WarpGrep is model-agnostic. It works with every model listed on this page, proprietary or open-source. Pairing it with Opus 4.6 makes the system 15.6% cheaper and 28% faster on SWE-Bench Pro tasks, because the expensive model spends less time doing its own search.

Frequently Asked Questions

What is the best open-source coding model in 2026?

MiniMax M2.5 leads SWE-bench Verified at 80.2%, within 0.6 points of Claude Opus 4.6. GLM-5 (77.8%) is the best MIT-licensed option. Qwen3-Coder-Next (70.6%) is the best model you can run locally on consumer hardware with only 3B active parameters.

Can open-source coding models compete with Claude and GPT?

Yes. MiniMax M2.5 at 80.2% is 0.6 points behind Claude Opus 4.6 (80.8%) and ahead of GPT-5.2 (80.0%). The gap has compressed to single-digit percentages across the board. On specific tasks (tool calling, LiveCodeBench), some open-source models already lead.

What hardware do I need to run Qwen3-Coder-Next locally?

46GB of unified memory for the full 80B model. 30GB with 2-bit quantization. The 30B Flash variant runs on 18GB (RTX 4090 or MacBook Pro with M2 Pro). Set context to 32K tokens if you hit memory limits.

What is the cheapest open-source coding model API?

Qwen3-Coder 480B at $0.22/M input, $1.00/M output. That is 95%+ cheaper than Claude Opus 4.6 ($15/$75 per million tokens).

What is Qwen3-Coder-Next vs Qwen3-Coder?

Qwen3-Coder 480B (35B active) optimizes for maximum benchmark performance. Qwen3-Coder-Next (80B total, 3B active) optimizes for local development efficiency, achieving 70.6% SWE-bench Verified with 10x fewer active parameters using hybrid attention and ultra-sparse MoE.

Is GLM-5 really fully open-source?

Yes. MIT license. Weights on Hugging Face and ModelScope. Free API at api.z.ai. 744B parameters, trained on Huawei Ascend chips. The most permissive license among the frontier open-source coding models.

Which model has the largest context window?

Qwen3-Coder models support 256K natively, extendable to 1M with YaRN. Kimi K2.5 supports 256K. GLM-5 supports 204K. MiniMax M2.5 supports 200K. DeepSeek V3.2 has the smallest at 128K.

What is Mercury 2?

Mercury 2 from Inception Labs uses diffusion-based language modeling instead of autoregressive decoding. It generates 1,000 tokens per second (5x faster than speed-optimized alternatives). It scores 67.3% on LiveCodeBench. It is not open-source but signals where fast inference is heading.

Private deployments

The fastest endpoints are private deployments

Morph's top speeds come from dedicated deployments, not shared public endpoints: speculators trained on your traffic, caching tuned to your workload, and volume discounts over public per-token rates. Over 100 billion tokens per day run this way.

Talk to us about a private deployment

WarpGrep: Search Subagent for Any Coding Model

WarpGrep v2 lifts every model it is paired with by 2-4 points on SWE-Bench Pro. It runs in its own context window, issues 8 parallel tool calls per turn, and makes your coding agent cheaper and faster. Works with open-source and proprietary models.

Try WarpGrep

Read the WarpGrep v2 Post

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Best Open-Source Coding Models in 2026: Benchmarks, VRAM, and Real Performance

The February 2026 Wave

Master Comparison: Open-Source Coding Models (March 2026)

MiniMax M2.5: Highest SWE-bench Score Among Open Weights

Hardware Requirements

GLM-5 (Zhipu AI): MIT License, 744B Parameters, Trained on Ascend

Tradeoffs

Hardware Requirements

Qwen3-Coder-Next: 70.6% SWE-bench with 3B Active Parameters

Running Locally

Qwen3-Coder 480B: The Full-Size Agentic Coder

Kimi K2.5 (Moonshot AI): Visual Coding + LiveCodeBench Leader

License and Deployment

DeepSeek V3.2: Reasoning Depth and IOI Gold

Mercury 2 (Inception Labs): The Speed Outlier

Legacy Models: Codestral and StarCoder 2

Codestral (Mistral AI)

StarCoder 2

Best For: Which Model for Which Use Case

Best overall: MiniMax M2.5

Best MIT license: GLM-5

Best for local dev: Qwen3-Coder-Next

Best for visual coding: Kimi K2.5

Best for reasoning: DeepSeek V3.2

Best for agentic workflows: Qwen3-Coder 480B

Best for speed: Mercury 2 (not open-source)

Best for FIM completion: Codestral 25.08

API Pricing Comparison

How WarpGrep Fits In

Frequently Asked Questions

What is the best open-source coding model in 2026?

Can open-source coding models compete with Claude and GPT?

What hardware do I need to run Qwen3-Coder-Next locally?

What is the cheapest open-source coding model API?

What is Qwen3-Coder-Next vs Qwen3-Coder?

Is GLM-5 really fully open-source?

Which model has the largest context window?

What is Mercury 2?

The fastest endpoints are private deployments

WarpGrep: Search Subagent for Any Coding Model