MiniMax M2.5 for Coding: The $1/Hour Frontier Model That Matches Opus

MiniMax M2.5 scores 80.2% on SWE-Bench Verified, within 0.6 points of Claude Opus 4.6, at 1/20th the cost. 229B MoE with 10B active params. Open-source on HuggingFace.

March 4, 2026 · 2 min read

Summary

MiniMax M2.5 at a Glance (Feb 2026)

  • What it is: 229B MoE model (10B active params), open-source on HuggingFace, trained on 10+ languages across 200K+ real-world environments
  • Why it matters: 80.2% SWE-Bench Verified at $0.15/task. Opus 4.6 scores 80.8% at $3.00/task. The accuracy gap is 0.6 points. The cost gap is 20x.
  • Two variants: M2.5 (50 tok/sec, $1.20/M output) and M2.5-Lightning (100 tok/sec, $2.40/M output). Same capability, different speed.
80.2%
SWE-Bench Verified (vs Opus 80.8%)
$0.15
Cost per SWE-Bench task (vs $3.00 Opus)
229B
Total params, 10B active (MoE)
100
Tokens/sec native throughput (Lightning)

MiniMax shipped M2, M2.1, and M2.5 in three and a half months, from late October to mid-February 2026. Each version closed the gap to proprietary frontier models. M2.5 closed it almost entirely for coding tasks, while remaining 20x cheaper. The model weights are open, the API is cheap, and GGUF quantizations are available for local deployment.

Benchmark Breakdown

BenchmarkMiniMax M2.5Claude Opus 4.6GPT-5.2
SWE-Bench Verified80.2%80.8%80.0%
Multi-SWE-Bench51.3%50.3%N/A
BFCL Multi-Turn76.8%63.3%N/A
BrowseComp76.3%N/AN/A
Cost per SWE-Bench task~$0.15~$3.00~$2.50
Task completion speed22.8 min avg22.9 min avgN/A

Two numbers stand out. First, M2.5 leads Multi-SWE-Bench at 51.3% vs Opus's 50.3%. Multi-SWE-Bench tests multi-file, cross-language tasks, the kind of work that most closely resembles real software engineering. Second, M2.5 scores 76.8% on BFCL multi-turn function calling vs Opus's 63.3%. For agentic workflows that rely on tool use, that 13.5-point gap is significant.

Benchmark Context

MiniMax tested SWE-Bench using Claude Code as the scaffolding. This means M2.5's scores reflect performance inside a specific agent harness, not raw model capability in isolation. Different scaffolding (Aider, OpenHands, custom agents) may yield different results. The agent-native training via Forge is designed to minimize this variance, but treat all SWE-Bench scores as scaffold-dependent.

Architecture: Why MoE Makes This Possible

M2.5 has 229 billion total parameters but only activates 10 billion per token. It routes each token through 8 of 256 available experts. This is how MiniMax achieves frontier accuracy at a fraction of the compute cost: most of the model sits idle on any given inference step.

256 Experts, 8 Active

Each token activates only 8 of 256 expert networks. 10B active parameters per inference step, keeping FLOPS low while total knowledge capacity stays at 229B.

Agent-Native Training (Forge)

MiniMax's Forge framework decouples the training engine from agent scaffolding. The model optimizes for generalization across frameworks, not a single tool. Arbitrary agents can be integrated during RL training.

CISPO for MoE Stability

MiniMax uses the CISPO algorithm to keep expert routing stable during large-scale training. Without this, MoE models suffer from expert collapse where a few experts handle all tokens.

40x Training Speedup

Asynchronous scheduling and tree-structured sample merging achieved ~40x training speedup. This is how MiniMax shipped three model versions in 3.5 months.

The MoE architecture explains why M2.5 can be both cheap and accurate. Dense models like Opus activate every parameter on every token. MoE models select a small subset of specialists per token. The trade-off is complexity in routing and training stability. MiniMax solved the training stability problem with CISPO. They solved the routing problem with 256 fine-grained experts instead of the 8-16 experts used by earlier MoE models like Mixtral.

Pricing: The 20x Advantage

MiniMax M2.5M2.5-LightningClaude Opus 4.6
Input tokens$0.30/M$0.30/M$15.00/M
Output tokens$1.20/M$2.40/M$75.00/M
Throughput50 tok/sec100 tok/sec~50 tok/sec
1 hour continuous use$0.30$1.00~$20
SWE-Bench task cost~$0.15~$0.30~$3.00

At $0.30/M input and $1.20/M output, M2.5 is 50x cheaper than Opus on input tokens and 62x cheaper on output tokens. The M2.5-Lightning variant doubles throughput to 100 tok/sec for double the output cost, still 31x cheaper than Opus. For a team running 100 agentic coding tasks per day, M2.5 costs ~$15/day. The same workload on Opus costs ~$300/day. Over a month, that is $450 vs $9,000.

Self-Hosting Option

Because M2.5 weights are open-source on HuggingFace, teams with GPU infrastructure can self-host. Community GGUF quantizations from Unsloth bring the model to smaller hardware. Self-hosting eliminates per-token costs entirely, leaving only compute and memory costs. For high-volume use cases, the break-even point vs API pricing arrives quickly.

Stat Comparison: M2.5 vs Opus 4.6

Beyond raw benchmarks, these models differ in how they handle different types of coding work.

🔥

MiniMax M2.5

The open-source cost destroyer

Coding Accuracy
Speed
Cost Efficiency
Tool Calling
Complex Reasoning
Best For
High-volume agentic tasksCost-sensitive teamsSelf-hosted deploymentsMulti-turn tool use

"Frontier coding at 1/20th the cost. Open weights included."

🧠

Claude Opus 4.6

The reasoning heavyweight

Coding Accuracy
Speed
Cost Efficiency
Tool Calling
Complex Reasoning
Best For
Complex reasoning tasksTerminal-heavy workflowsLarge-scale tool orchestrationHigh-stakes production code

"Best absolute accuracy, but the cost premium is hard to justify at scale."

Where M2.5 Wins

High-Volume Agentic Tasks

At $0.15/task, M2.5 makes it economical to run hundreds of agentic coding tasks per day. Bug triage, test generation, code review, and migration scripts become viable at scale.

Multi-Turn Tool Calling

76.8% on BFCL multi-turn vs Opus's 63.3%. For agent workflows that chain multiple tool calls across conversation turns, M2.5 is measurably more reliable.

Multi-Language Codebases

51.3% on Multi-SWE-Bench vs Opus's 50.3%. Trained on 10+ languages across 200K+ environments. For polyglot projects mixing TypeScript, Python, Rust, and Go, M2.5 handles the context switches better.

Open-Source Flexibility

Full weights on HuggingFace. Fine-tune on your codebase, self-host for zero per-token cost, or run quantized versions locally. Opus is API-only with no weight access.

The cost advantage compounds when you consider subagent architectures. Running 4 parallel subagents on Opus costs $12/task. On M2.5, the same 4-agent setup costs $0.60. Multi-agent workflows, where dedicated context windows prevent context pollution, are the direction coding agents are moving. M2.5 makes that architecture economically viable for teams that cannot justify Opus-level spend.

Where Opus 4.6 Still Wins

Terminal-Heavy Workflows

Opus 4.6 leads Terminal-Bench 2.0 at 65.4%. For DevOps automation, shell scripting, and CLI tool building, Opus produces more reliable output in terminal environments.

Large-Scale Tool Orchestration

Opus leads MCP Atlas at 62.7% for coordinating many tools simultaneously. When your agent needs to juggle 20+ tools in a single session, Opus handles the coordination better.

Complex Reasoning Chains

OSWorld: 72.7%. For tasks that require deep reasoning across many steps, like debugging race conditions or architecting distributed systems, Opus's reasoning depth is hard to match.

Instruction Following

Opus is more reliable at following detailed specs without drifting. When your workflow requires deterministic, plan-adherent outputs, the extra cost buys consistency.

The honest assessment: Opus 4.6 is a better model. M2.5 is a better value. For 95% of coding tasks, you will not notice the 0.6-point accuracy difference. For the remaining 5%, complex terminal workflows, massive tool orchestration, and deep architectural reasoning, Opus is worth the premium. The smart approach is using M2.5 for volume and Opus for the hard problems.

Using M2.5 with WarpGrep

Model quality is one variable. Search quality is the other. Coding agents spend the majority of their tokens navigating codebases, not writing code. WarpGrep is an MCP server that gives any coding agent parallel, sub-6-second codebase search. It works with M2.5 the same way it works with Opus or GPT-5.2.

M2.5 + WarpGrep via MCP

# WarpGrep works as an MCP server with any model
# Aider, Continue, OpenHands, Claude Code all support MCP

# In your MCP config:
{
  "mcpServers": {
    "warpgrep": {
      "command": "npx",
      "args": ["warpgrep-mcp"]
    }
  }
}

# M2.5 can then call WarpGrep tools:
# - search_codebase: parallel semantic + keyword search
# - find_references: cross-file dependency tracing
# - explore_architecture: structural codebase understanding

# The search happens in <6 seconds across 100K+ file repos
# M2.5's strong tool calling (76.8% BFCL) means
# reliable WarpGrep integration out of the box

On SWE-bench Pro, Opus 4.6 + WarpGrep v2 scores 57.5%, up from 55.4% stock. That 2.1-point improvement comes entirely from better search, not a better model. The same principle applies to M2.5: pair a strong model with strong search, and performance on real-world tasks improves more than switching to a more expensive model with worse search.

Frequently Asked Questions

How does MiniMax M2.5 compare to Claude Opus 4.6 for coding?

M2.5 scores 80.2% on SWE-Bench Verified vs Opus's 80.8%, a 0.6-point gap. M2.5 leads on Multi-SWE-Bench (51.3% vs 50.3%) and multi-turn function calling (BFCL: 76.8% vs 63.3%). Opus leads on Terminal-Bench 2.0 and MCP Atlas. The defining difference is cost: $0.15/task on M2.5 vs $3.00 on Opus. Both models complete SWE-Bench tasks in ~23 minutes on average.

Is MiniMax M2.5 open source?

Yes. Full weights are on HuggingFace under MiniMaxAI/MiniMax-M2.5. Community GGUF quantizations are available from Unsloth and others. You can self-host, fine-tune, or run quantized versions locally. The API is also available at $0.30/M input and $1.20/M output tokens.

What does "agent-native" mean?

MiniMax built a training framework called Forge that decouples the RL training engine from agent scaffolding. Instead of training the model to work well in one specific agent system, Forge introduces an intermediary layer that supports arbitrary agent integrations during training. The result is a model that generalizes across Claude Code, Aider, Continue, OpenHands, and other agent frameworks without scaffold-specific tuning.

Should I use M2.5 or M2.5-Lightning?

Same model, different speed/cost trade-off. M2.5 runs at 50 tok/sec for $1.20/M output tokens. M2.5-Lightning runs at 100 tok/sec for $2.40/M output. Use Lightning for interactive development where latency matters. Use standard M2.5 for batch/background agentic tasks where cost matters more than speed.

Can I fine-tune M2.5 on my codebase?

Yes. Open weights mean full fine-tuning, LoRA, or QLoRA on your proprietary code. For teams with domain-specific patterns (internal frameworks, custom APIs, proprietary DSLs), fine-tuning M2.5 can close the remaining gap to Opus on your specific workload. Unsloth provides optimized fine-tuning scripts for M2.5.

WarpGrep Boosts Any Model's Coding Performance

Opus 4.6 + WarpGrep v2 scores 57.5% on SWE-bench Pro, up from 55.4% stock. WarpGrep works as an MCP server with M2.5, Opus, GPT-5.2, and any agent that supports MCP. Better search = better context = better code, regardless of which model you choose.

Sources