GLM-5 vs Kimi K2.5 (2026): Benchmarks, Pricing, and Which to Pick

GLM-5 and Kimi K2.5 are the two strongest open-weight models from China. We compare benchmarks, API pricing, coding performance, and architecture to help you choose.

March 2, 2026 ยท 1 min read

GLM-5 and Kimi K2.5 are the two strongest open-weight models out of China in early 2026. Both use trillion-scale MoE architectures, both carry MIT-family licenses, and both outperform GPT-5.4 on most benchmarks. But they solve different problems.

GLM-5 (Zhipu AI, released February 11, 2026) wins on reasoning consistency, agentic tool use, and long-context reliability. Kimi K2.5 (Moonshot AI, released January 27, 2026) wins on raw coding benchmarks, multimodal vision tasks, and parallel agent orchestration.

Quick Verdict

  • Pick GLM-5 if you need consistent multi-turn reasoning, agentic tool use, or text-only workflows. It has the best hallucination resistance of any open-weight model and ranks #1 on the Chatbot Arena among open models (Elo 1451).
  • Pick Kimi K2.5 if you need vision capabilities, frontend code generation from screenshots, or parallel agent orchestration. Its Agent Swarm system coordinates up to 100 sub-agents and cuts task completion time by 3-4.5x.
  • For pure coding: Kimi K2.5 leads on HumanEval (99% vs 90%) and LiveCodeBench (85% vs 52%). GLM-5 leads on SWE-bench Verified (77.8% vs 76.8%) and Terminal-Bench (56.2% vs 50.8%).
  • For cost: Kimi K2.5 is 40% cheaper per token, but its verbosity can erase that advantage in practice.

Head-to-Head Comparison

GLM-5Kimi K2.5
DeveloperZhipu AI (Z.AI)Moonshot AI
Release DateFeb 11, 2026Jan 27, 2026
Total Parameters744B1T (1,000B)
Active Parameters40B per token32B per token
ArchitectureMoE, 256 experts, 8 activeMoE, 384 experts, 8 active
Context Window200K tokens262K tokens
Max Output128K tokens33K tokens
Vision SupportNoYes (images + video)
LicenseMITModified MIT
Open WeightsYes (HuggingFace)Yes (HuggingFace)
Training Hardware100K Huawei Ascend 910BNVIDIA GPUs
Arena Elo (Open Weight)1451 (#1)1447 (#2)
API Input Price$1.00/M tokens$0.60/M tokens
API Output Price$3.20/M tokens$2.50/M tokens
Output Speed (median)71 tok/s~80 tok/s

Benchmark Breakdown

Both models compete at the frontier of open-weight performance. The numbers below come from official technical reports, Chatbot Arena, and third-party evaluation platforms.

Reasoning and Knowledge

BenchmarkGLM-5Kimi K2.5Winner
MMLU85%92%Kimi K2.5
MMLU-Pro70.4%87.1%Kimi K2.5
GPQA Diamond86.0%87.6%Kimi K2.5
SimpleQA48%54%Kimi K2.5
IFEval88%94%Kimi K2.5
Humanity's Last Exam30.5%31.5%~Tie
HLE (with tools)50.4%51.8%~Tie

Kimi K2.5 dominates general knowledge benchmarks. The MMLU gap (92% vs 85%) is large. On harder evals like Humanity's Last Exam, the models converge. GLM-5's edge is less about raw scores and more about reliability: it achieved the industry's lowest hallucination rate (measured by the AA-Omniscience Index), meaning it refuses to answer rather than fabricate.

Math

BenchmarkGLM-5Kimi K2.5Winner
MATH-50088%98%Kimi K2.5
AIME 202584%96.1%Kimi K2.5
AIME 202692.7%92.5%~Tie
HMMT 202596.9%91.1%GLM-5
GSM8k97%99%Kimi K2.5

Kimi K2.5 is the stronger math model overall. Its AIME 2025 score of 96.1% is exceptional. GLM-5 pulls ahead on HMMT 2025 (96.9% vs 91.1%), a competition-level math benchmark that rewards deep multi-step reasoning over pattern matching.

Vision and Multimodal

BenchmarkGLM-5Kimi K2.5Winner
MMMUN/A (text only)84%Kimi K2.5
MMMU ProN/A78.5%Kimi K2.5
MathVistaN/A84.2%Kimi K2.5
DocVQAN/A88.8%Kimi K2.5
ChartQAN/A77.5%Kimi K2.5

This is not a contest. GLM-5 is text-only. Kimi K2.5 was built as a native multimodal model, trained from the start on 15 trillion mixed visual and text tokens. It processes images, screenshots, charts, documents, and video. If your workflow involves any visual input, Kimi K2.5 is the only option.

Coding Performance

Coding benchmarks tell two different stories depending on whether you care about isolated code generation or real-world software engineering.

BenchmarkGLM-5Kimi K2.5Winner
HumanEval90%99%Kimi K2.5
LiveCodeBench v652%85%Kimi K2.5
SWE-bench Verified77.8%76.8%GLM-5
SWE-bench Multilingual73.3%73%~Tie
Terminal-Bench 2.056.2%50.8%GLM-5
CyberGym43.2%41.3%GLM-5

Kimi K2.5's HumanEval score of 99% is the highest of any model tracked. Its LiveCodeBench score of 85% also leads GLM-5 by a wide margin (52%). These benchmarks test function-level code generation: given a prompt, write the correct function.

GLM-5 wins where it matters for production work. SWE-bench Verified (77.8% vs 76.8%) tests the ability to navigate real GitHub repos, understand issue descriptions, and produce correct patches across multiple files. Terminal-Bench 2.0 (56.2% vs 50.8%) tests agentic coding in terminal environments. These benchmarks reward the kind of systematic reasoning that matters when you're working on actual codebases, not competitive programming problems.

Developer Experience

Community feedback from r/LocalLLaMA and developer blogs is consistent: Kimi K2.5 excels at frontend development and visual-to-code workflows. It can take a screenshot of a UI and generate matching HTML/CSS with high fidelity. GLM-5 is stronger for backend work, debugging complex stack traces, and multi-file refactors where context tracking across long conversations matters.

One recurring complaint about Kimi K2.5: it tends to generate verbose, over-engineered code on first pass. Developers report asking it to simplify as a common follow-up. GLM-5 produces more concise output by default but occasionally misses edge cases that K2.5 catches by being thorough.

GLM-5 Coding Strengths

SWE-bench leader among open models. Strong at multi-file reasoning, debugging stack traces, and agentic terminal workflows. Concise output. 128K max output means it can generate entire modules in a single response.

Kimi K2.5 Coding Strengths

Near-perfect HumanEval (99%). Dominant on LiveCodeBench. Best open model for frontend/UI code generation from screenshots. Agent Swarm parallelizes complex tasks across up to 100 sub-agents.

API Pricing

Both models are dramatically cheaper than proprietary alternatives. For reference, Claude Opus 4.5 costs $15/M input tokens.

GLM-5Kimi K2.5GPT-5.2 (ref)
Input (per 1M tokens)$1.00$0.60$1.25
Output (per 1M tokens)$3.20$2.50$10.00
Blended cost (1:1 ratio)$2.10$1.55$5.63
Cheapest providerDeepInfra FP8: $0.80 inNovita AIOpenAI

On paper, Kimi K2.5 is about 26% cheaper per token. But real-world cost depends on token efficiency. Multiple developers report that Kimi K2.5 consumes 2-2.5x more tokens than comparable models for the same tasks due to verbose output. One evaluation measured 89 million tokens for a benchmark run where other models used ~35 million.

If your use case involves short, targeted queries, Kimi K2.5's lower per-token price wins. For long, agentic workflows where token efficiency matters, GLM-5's more concise output can make it cheaper in practice despite the higher per-token rate.

$0.60
Kimi K2.5 input / 1M tokens
$1.00
GLM-5 input / 1M tokens
26%
Per-token savings (Kimi)

Watch Out for Verbosity

Per-token pricing is misleading if one model uses 2x more tokens. Track total cost per task, not just cost per million tokens. Both models are available through 8-9 API providers with different price points. DeepInfra and Together.ai consistently offer the lowest rates for both.

Architecture Deep Dive

GLM-5: Hardware Sovereignty and Reasoning Depth

GLM-5 is a 744B parameter MoE model with 256 experts, activating 8 per token (40B active parameters). It was trained entirely on 100,000 Huawei Ascend 910B chips using the MindSpore framework. Zero NVIDIA hardware was involved. This matters for organizations in sanctioned regions or those seeking supply chain independence.

The architecture uses Multi-head Latent Attention (MLA) and DeepSeek's Sparse Attention (DSA) mechanism for efficient long-context handling, reducing memory overhead by 33%. Its 200K context window handles long documents well, and the 128K max output capacity is among the highest of any model, useful for generating large code files or documentation in a single pass.

GLM-5's "Preserved Thinking" feature maintains reasoning state across 50+ conversation turns, a trait that benchmarks rarely capture but developers notice in extended coding sessions.

Kimi K2.5: Multimodal Foundation and Agent Swarm

Kimi K2.5 is a 1 trillion parameter MoE model with 384 experts, activating 8 per token (32B active). It was trained on 15 trillion tokens of mixed visual and text data from the start. This joint training means vision and language are not separate modules bolted together; they developed in unison.

The standout architectural feature is Agent Swarm. Trained with Parallel-Agent Reinforcement Learning (PARL), K2.5 can self-direct up to 100 sub-agents executing parallel workflows across up to 1,500 coordinated tool calls. Internal evaluations show 80% reduction in end-to-end runtime for complex tasks and 3-4.5x wall-clock time reduction versus single-agent execution.

Its 400M parameter vision encoder (MoonViT) handles images, screenshots, charts, and video. The 262K context window is 30% larger than GLM-5's, though the max output is limited to 33K tokens.

GLM-5 Unique Features

Trained on 100K Huawei Ascend chips (no NVIDIA). Industry-lowest hallucination rate. 128K max output. Preserved Thinking for multi-turn reasoning. MIT license.

Kimi K2.5 Unique Features

Native multimodal (text, image, video). Agent Swarm with 100 parallel sub-agents. 400M param MoonViT vision encoder. PARL training. 262K context window.

Agentic and Tool Use

BenchmarkGLM-5Kimi K2.5Winner
BrowseComp62%60.6%GLM-5
BrowseComp (Swarm)N/A78.4%Kimi K2.5
Tau2Bench89.7%80.2%GLM-5
Tool-Decathlon38%27.8%GLM-5
MCP-Atlas67.8%63.8%GLM-5

GLM-5 is the better single-agent tool user. It leads on Tau2Bench (89.7% vs 80.2%), Tool-Decathlon (38% vs 27.8%), and MCP-Atlas (67.8% vs 63.8%). These benchmarks test a model's ability to correctly select and invoke tools, handle multi-step workflows, and recover from errors.

Kimi K2.5 compensates with Agent Swarm. When allowed to spawn parallel sub-agents, its BrowseComp score jumps from 60.6% to 78.4%. For tasks that can be parallelized, K2.5's swarm approach is faster and often more effective than GLM-5's single-agent execution.

When to Use Which

Your Use CasePick ThisWhy
Backend engineeringGLM-5Leads SWE-bench, Terminal-Bench; concise output; 128K max output
Frontend / UI developmentKimi K2.5Screenshot-to-code, native vision, best LiveCodeBench score
Competitive programmingKimi K2.5HumanEval 99%, LiveCodeBench 85%, AIME 2025 96.1%
Multi-file debuggingGLM-5Preserved Thinking across 50+ turns; strong context tracking
Document analysisKimi K2.5Native vision handles PDFs, charts, screenshots; 262K context
Agentic tool useGLM-5Leads Tau2Bench, Tool-Decathlon, MCP-Atlas
Parallelizable tasksKimi K2.5Agent Swarm: 100 sub-agents, 1,500 tool calls, 80% runtime reduction
Low-hallucination applicationsGLM-5Best-in-class refusal rate; prefers 'I don't know' over fabrication
Budget-constrained projectsKimi K2.540% cheaper per token (but watch verbosity)
Sanctioned regions / no NVIDIAGLM-5Trained entirely on Huawei Ascend chips; no US hardware dependency

Frequently Asked Questions

Is GLM-5 or Kimi K2.5 better for coding?

Depends on the task. Kimi K2.5 scores higher on HumanEval (99% vs 90%) and LiveCodeBench (85% vs 52%), making it stronger for standalone code generation and competitive programming. GLM-5 edges ahead on SWE-bench Verified (77.8% vs 76.8%) and Terminal-Bench (56.2% vs 50.8%), which test real-world multi-file software engineering. For frontend and visual-to-code work, Kimi K2.5 wins by default since GLM-5 has no vision support.

Are both models truly open source?

Both are open-weight with commercial-friendly licenses. GLM-5 uses MIT; Kimi K2.5 uses Modified MIT. Weights are on Hugging Face. You can self-host, fine-tune, and deploy commercially. Neither model's training data or full training code is public, so "open weight" is more accurate than "open source" in the strict sense.

Which model is cheaper to run?

Kimi K2.5 has lower per-token pricing ($0.60/$2.50 vs $1.00/$3.20). But it tends to generate more tokens per response. Track cost per task, not cost per token. For short queries, K2.5 is cheaper. For long agentic sessions, GLM-5's conciseness can make it cheaper overall.

Can Kimi K2.5 process images and video?

Yes. It was trained from scratch on 15 trillion mixed visual and text tokens with a 400M parameter vision encoder (MoonViT). It handles images, screenshots, charts, documents, and video. GLM-5 is text-only.

Which model ranks higher on the Chatbot Arena?

GLM-5 holds Elo 1451, Kimi K2.5 holds 1447. Both rank in the top 3 among open-weight models. GLM-5 was the first open-weight model to hold the #1 spot on the LMSYS Chatbot Arena. The gap is small enough that real-world preference depends on your specific tasks.

Can I run these models locally?

Both are available on Hugging Face for self-hosting. At 744B and 1T total parameters respectively, you need serious hardware for full-precision inference. Quantized versions (FP8, FP4) are available from providers like DeepInfra and Together.ai, making them practical on multi-GPU setups. For most developers, API access through one of the 8-9 available providers is more practical.

Using Open-Weight Models in Your Stack?

Morph routes requests across multiple model providers and applies fast, token-efficient code edits. If you're building with GLM-5, Kimi K2.5, or any other model, Morph's apply engine merges AI-generated code changes into your files accurately and at a fraction of the cost of re-generating full files. Try the playground to see it in action.

Ship Faster with Morph

Morph's fast apply engine works with any model output. Route GLM-5 or Kimi K2.5 through Morph to get accurate code edits at lower cost.