February 2026 gave us two frontier open-weight models from Chinese AI labs in back-to-back days. Zhipu AI shipped GLM-5 on February 11. MiniMax shipped M2.5 on February 12. Both are Mixture-of-Experts architectures. Both use MIT licenses. Both claim frontier performance.
But they solve different problems. GLM-5 is a 744B-parameter reasoning heavyweight trained on Huawei Ascend chips, scoring 92.7% on AIME 2026 and 50.4 on Humanity's Last Exam. MiniMax M2.5 is a leaner 230B-parameter coding machine that hits 80.2% on SWE-bench Verified while costing roughly one-third per token.
This guide breaks down the benchmarks, pricing, and real-world performance so you can pick the right model for your workload.
TL;DR
- Pick GLM-5 if you need top-tier reasoning, math, and science. It leads Arena Elo at 1451, scores 92.7% on AIME 2026, and has the lowest hallucination rate among open-weight models.
- Pick MiniMax M2.5 if you need fast, cheap coding agents. 80.2% SWE-bench, 51.3% Multi-SWE-bench, and $1.20/M output tokens (vs GLM-5's $3.20).
- Both are MIT-licensed, open-weight, and available on OpenRouter and multiple global providers.
Quick Comparison
| Spec | GLM-5 (Zhipu AI) | MiniMax M2.5 |
|---|---|---|
| Release Date | Feb 11, 2026 | Feb 12, 2026 |
| Total Parameters | 744B | 230B |
| Active Parameters | 40B | 10B |
| Architecture | MoE | MoE |
| Context Window | 200K tokens | 205K tokens |
| Training Data | 28.5T tokens | Not disclosed |
| License | MIT | MIT |
| Input Price (per 1M) | $1.00 | $0.30 |
| Output Price (per 1M) | $3.20 | $1.20 |
| Output Speed | ~71 tok/s | 50-100 tok/s |
| SWE-bench Verified | 77.8% | 80.2% |
| AIME 2026 | 92.7% | N/A |
| Arena Elo | 1451 | TBD |
| Multimodal | Text + Vision + Audio + Video | Text only (API) |
Architecture: Heavyweight vs Lightweight MoE
Both models use Mixture-of-Experts, but at very different scales.
GLM-5 packs 744B total parameters with 40B active per forward pass (5.4% activation ratio). It was trained entirely on Huawei Ascend chips, making it the first frontier model built without Nvidia hardware. Zhipu fed it 28.5 trillion tokens and deployed Dynamic Sparse Attention (DSA) to keep inference costs manageable at 200K context.
MiniMax M2.5 is much smaller: 230B total, 10B active (4.3% activation ratio). This makes it significantly cheaper to serve and easier to self-host. MiniMax offers two inference tiers: Standard at 50 tok/s and Lightning at 100 tok/s.
The practical difference: GLM-5 can bring more knowledge and reasoning depth per query. M2.5 can process more queries per dollar. For agent loops that make dozens of calls per task, M2.5's efficiency compounds fast.
Reasoning: GLM-5 Dominates
GLM-5 is the stronger reasoning model. It is not close.
| Benchmark | GLM-5 | MiniMax M2.5 | For Reference |
|---|---|---|---|
| AIME 2026 I | 92.7% | N/A | Opus 4.5: 93.3% |
| GPQA-Diamond | 86.0% | N/A | Opus 4.5: 87.0% |
| Humanity's Last Exam | 50.4 (tools) | N/A | GPT-5.2: 45.5 |
| HMMT Nov 2025 | 96.9% | N/A | |
| MMLU | N/A | 85% | |
| MMLU Pro | N/A | 76.5% | |
| IFEval | 88% | N/A | |
| AA Intelligence Index | 50 (#1) | 42 (#5) | Median: 26 |
GLM-5's 50.4 on Humanity's Last Exam (with tools) beats both GPT-5.2 (45.5) and Claude Opus 4.5 (43.4). On AIME 2026, it trails Opus 4.5 by just 0.6 points. Its Artificial Analysis Intelligence Index score of 50 ranks first among all 66 tracked models.
Zhipu also claims a 35-point improvement in hallucination rates over GLM-4.5, measured on their AA-Omniscience evaluation. For tasks that need factual accuracy, like technical documentation, research assistance, or knowledge base construction, GLM-5 is the safer choice.
MiniMax M2.5 was not designed to compete here. Its MMLU score of 85% is solid but not frontier. MiniMax focused its training budget on coding and agent execution instead.
Coding: MiniMax M2.5 Leads
This is where M2.5 pulls ahead.
| Benchmark | GLM-5 | MiniMax M2.5 | For Reference |
|---|---|---|---|
| SWE-bench Verified | 77.8% | 80.2% | Opus 4.6: 80.8% |
| Multi-SWE-bench | N/A | 51.3% | Opus 4.6: 50.3% |
| SWE-bench Multilingual | 73.3% | N/A | |
| BrowseComp (w/context) | 75.9% | 76.3% | |
| BFCL Multi-Turn | N/A | 76.8% | |
| Terminal-Bench 2.0 | 56.2% | N/A | |
| HumanEval | 90% | N/A | |
| MCP Atlas | 67.8% | N/A |
M2.5's 80.2% on SWE-bench Verified puts it neck-and-neck with Claude Opus 4.6 (80.8%) and ahead of GPT-5.2 (80.0%). On Multi-SWE-bench, which tests complex multi-file tasks, M2.5 actually leads Opus 4.6 (51.3% vs 50.3%).
More importantly, M2.5 is fast. It completes a single SWE-bench task in 22.8 minutes, 37% faster than its predecessor M2.1. MiniMax attributes this to its "Spec-writing" coding style: the model breaks down architecture before implementing, reducing trial-and-error loops that burn tokens.
GLM-5 is no slouch at coding. Its 77.8% SWE-bench and 90% HumanEval scores are strong. But it was not optimized for speed or cost-efficiency in agent loops the way M2.5 was.
Real-World Coding: Kilo Code's Head-to-Head Test
Kilo Code ran both models through three TypeScript tasks: bug hunting, legacy refactoring, and API implementation from an OpenAPI spec. The results are revealing.
Test 1: Bug Hunt (30 points)
Find and fix 8 bugs in a Node.js/Hono task API with race conditions, SQL injection, and JWT vulnerabilities. Both models scored 28.5/30 and found all 8 bugs. M2.5 wrote better documentation for each fix. GLM-5 made unnecessary changes beyond the minimal fix, like removing a 100ms delay alongside a race condition transaction fix.
Test 2: Legacy Refactoring (35 points)
Convert callback-based Express code to async/await. GLM-5 scored 34/35, using industry-standard libraries like express-validator and creating custom error classes. M2.5 scored 28/35, building a custom validation system that missed email format checks and applied error handling inconsistently. GLM-5 lost one point for changing an endpoint path, breaking API compatibility.
Test 3: API Implementation (35 points)
Build 27 endpoints from an OpenAPI spec. GLM-5 scored a perfect 35/35 with 94 test cases covering authentication, CRUD, authorization, pagination, and input validation. M2.5 scored 28/35 with a critical authorization bug (checking the wrong project ID) and only 13 test cases.
Speed vs Thoroughness
GLM-5 took 44 minutes total. M2.5 took 21 minutes. GLM-5 wrote 7x more tests but spent twice the time. For prototyping and iteration, M2.5's speed wins. For production-quality code that needs to be right on the first pass, GLM-5's thoroughness pays off.
Multimodal: GLM-5 Only
GLM-5 is natively multimodal from pre-training. Text, images, audio, and video are processed through interconnected pathways in a single inference pass. Zhipu reports state-of-the-art performance on MMMU (visual reasoning) and MathVista (mathematical visual understanding), outperforming GPT-5.4 on both.
MiniMax M2.5's API currently accepts text input only. MiniMax has separate models for image generation (Image-01) and other modalities, but M2.5 itself is text-in, text-out.
If your pipeline involves image understanding, chart analysis, or video processing alongside text, GLM-5 is the only option between these two.
Pricing: M2.5 Wins by a Wide Margin
| Metric | GLM-5 | MiniMax M2.5 |
|---|---|---|
| Input (per 1M tokens) | $1.00 | $0.30 |
| Output (per 1M tokens) | $3.20 | $1.20 |
| Blended (3:1 ratio) | $1.55 | $0.53 |
| Cost for 1hr @ 100 tok/s | ~$3.20 | ~$1.00 |
M2.5 is 2.7x cheaper on output tokens and 3.3x cheaper on input tokens. For agent loops that generate heavy output (like SWE-bench tasks), this adds up fast. MiniMax claims $1 runs the model continuously for an hour at 100 tok/s.
For context: Claude Opus 4.6 charges $5/$25 per million tokens (input/output). GPT-5.2 is roughly $1.75/$14. Both GLM-5 and M2.5 are dramatically cheaper than Western frontier models. M2.5 in particular is among the cheapest frontier-tier coding models available anywhere.
Cost Per SWE-bench Task
M2.5 completes SWE-bench tasks in ~23 minutes. At its pricing, a full SWE-bench Verified run (500 tasks) costs a fraction of what the same run costs on Claude or GPT. If you are running large-scale evaluations or operating coding agents at volume, this pricing difference is not trivial.
API Availability: Both Are Globally Accessible
Despite being from Chinese labs, both models are easy to access globally.
GLM-5
- Official API: chat.z.ai (Zhipu's platform)
- OpenRouter: Available with OpenAI-compatible API
- Inference providers: DeepInfra, Fireworks, Together.ai, SiliconFlow, Novita, GMI Cloud, Parasail, Google
- Hugging Face: zai-org/GLM-5 (full weights)
MiniMax M2.5
- Official API: api.minimax.io (global) / api.minimaxi.com (China)
- OpenRouter: Available with OpenAI-compatible API
- Inference providers: DeepInfra, Together.ai, NVIDIA NIM, and others
- Hugging Face: MiniMaxAI/MiniMax-M2.5 (full weights)
Both work with any tool that supports OpenAI-compatible endpoints. If your coding agent (Aider, Cline, Claude Code via proxy, etc.) accepts a custom API base URL, you can point it at either model through OpenRouter or any provider.
Self-Hosting: M2.5 is Far More Practical
Both models are MIT-licensed, so you can download and deploy them without restrictions.
| Requirement | GLM-5 | MiniMax M2.5 |
|---|---|---|
| Total Parameters | 744B | 230B |
| Active Parameters | 40B | 10B |
| Minimum GPUs | 8x A100 80GB | 2-4x A100 80GB (est.) |
| Recommended Framework | vLLM / SGLang | vLLM / SGLang |
| FP8 Quantization | Widely available | Widely available |
GLM-5 at 744B parameters needs serious hardware. Think 8x A100 80GB at minimum, which costs $15-25/hr on cloud providers. M2.5 at 230B total (10B active) is much more manageable and runs well on smaller GPU clusters, especially with FP8 quantization.
For teams that need to keep data on-premise or want to avoid per-token API costs at scale, M2.5 is the more practical self-hosting target.
When to Use GLM-5
- Math and science reasoning. 92.7% AIME 2026, 86.0% GPQA-Diamond, 96.9% HMMT. If your task requires multi-step mathematical or scientific reasoning, GLM-5 is the best open-weight model available.
- Factual accuracy matters. GLM-5 has the lowest hallucination rate among open-weight models, with a 35-point improvement over GLM-4.5 on Zhipu's internal eval. Technical docs, research assistance, knowledge bases.
- Multimodal pipelines. Native vision, audio, and video understanding. If you need to reason across modalities, GLM-5 does this in a single pass.
- General intelligence tasks. Arena Elo of 1451, Intelligence Index of 50 (#1 of 66 models). For open-ended reasoning where you need the smartest available open-weight model, GLM-5 wins.
- Multilingual SWE tasks. 73.3% on SWE-bench Multilingual suggests strong coding ability across languages beyond English.
When to Use MiniMax M2.5
- Coding agents at scale. 80.2% SWE-bench, 51.3% Multi-SWE-bench, 37% faster task completion than its predecessor. M2.5 was built for agent loops.
- Budget-sensitive deployments. $0.30/$1.20 per million tokens (input/output) is roughly 2.7x cheaper than GLM-5 and 20x cheaper than Claude Opus on output.
- High-volume API calls. The Lightning variant runs at 100 tok/s. For chatbots, search agents, or any workload with lots of short queries, M2.5 maximizes throughput per dollar.
- Web browsing agents. 76.3% BrowseComp and 76.8% BFCL Multi-Turn show strong tool-calling and web navigation. If you are building autonomous agents that interact with external tools, M2.5 handles that well.
- Self-hosting on moderate hardware. 230B params with 10B active is feasible on 2-4 GPUs. MIT license means no restrictions.
Frequently Asked Questions
Which is better for coding, GLM-5 or MiniMax M2.5?
MiniMax M2.5 leads on coding benchmarks. It scores 80.2% on SWE-bench Verified vs GLM-5's 77.8%, and completes tasks in about half the time. It also costs roughly one-third as much per output token. For coding workloads, M2.5 is the better choice. GLM-5 produces more thorough test coverage and better architecture on complex tasks, but takes twice as long.
Which is better for math and reasoning?
GLM-5, by a wide margin. It scores 92.7% on AIME 2026, 86.0% on GPQA-Diamond, and 50.4 on Humanity's Last Exam. These scores rival Claude Opus 4.5 and beat GPT-5.2.
Are these models available outside of China?
Yes. Both are on OpenRouter, DeepInfra, Together.ai, and other global providers with OpenAI-compatible APIs. Both publish weights on Hugging Face under MIT license. No geo-restrictions.
How do they compare to Claude Opus 4.6 or GPT-5.2?
On coding (SWE-bench Verified), M2.5 at 80.2% is within 0.6 points of Claude Opus 4.6's 80.8%. On reasoning (Humanity's Last Exam), GLM-5 at 50.4 beats both Opus 4.5 (43.4) and GPT-5.2 (45.5). Neither Chinese model matches the Western frontier on every benchmark, but both beat them in their respective specialties, and at a fraction of the cost.
Can I use both models in the same pipeline?
Absolutely. A practical setup: use M2.5 for high-volume coding agent tasks where speed and cost matter, and route complex reasoning or factual accuracy tasks to GLM-5. Both support OpenAI-compatible APIs, so swapping between them is a single config change.
Which should I pick if I can only choose one?
If your primary use case is coding agents or tool-calling, pick M2.5. If your primary use case is reasoning, research, or multimodal understanding, pick GLM-5. If you need both, use M2.5 as your default (cheaper) and escalate to GLM-5 when the task requires deeper reasoning.
Related Comparisons
Run Any Model Through Morph Fast Apply
GLM-5, MiniMax M2.5, Claude, GPT: it doesn't matter which model generates the edit. Morph applies it at 10,500+ tok/sec with 98% first-pass accuracy. One apply layer for every model.