GLM-5 is Zhipu AI's 744-billion-parameter open-source model. Released February 11, 2026. 77.8% SWE-bench Verified. 200K context. MIT license. This is the technical breakdown: what the architecture actually does, what the benchmarks actually measure, and why compute constraints are limiting who can use it right now.
The 37x Search Spike
In January 2026, 480 people searched "glm 5" per month. In February, 40,500 did. That is a 37x increase in one month. Search data like that only happens when a model ships and word spreads fast. GLM-5 released February 11, 2026, and the developer community noticed quickly.
What drove the interest: 77.8% on SWE-bench Verified at open-source weights with an MIT license. That combination does not exist elsewhere. Frontier-level coding performance with full commercial permission and downloadable weights is rare. The search spike reflects developers evaluating whether this is real.
Architecture: 744B with 40B Active
GLM-5 is a Mixture of Experts model. The total parameter count is 744 billion, but only 40 billion activate for any given token during inference. A router network decides which expert modules to engage based on the input. Most of the 744B sits dormant for any single forward pass.
The predecessor GLM-4.7 was 355B total with 32B active, trained on 23T tokens. GLM-5 scales the total capacity significantly while keeping active compute close to GLM-4.7. Pre-training data grew from 23T to 28.5T tokens. The architecture file identifies the model type as glm_moe_dsa, confirming both the MoE design and the DSA integration.
Compared to dense models: Claude Opus 4.6 is believed to be in the 200-300B parameter range based on inference speed and pricing. GPT-5.3 architecture is not public. GLM-5's 40B active parameters during inference puts its actual compute cost closer to a 40B dense model than a 744B one, which explains the $1.00/M input pricing.
Mixture of Experts (MoE)
744B total parameters, 40B active per token. Expert routing selects which parameter subsets engage for each input. Predecessor GLM-4.7 was 355B/32B. Inference cost stays close to 40B dense compute.
28.5 Trillion Training Tokens
Pre-trained on 28.5T tokens, up from 23T for GLM-4.7. Covers Chinese and English with heavy emphasis on code, reasoning, and agent interaction data.
Quantization Options
Released in BF16, FP8, and F32. The FP8 variant on Hugging Face (zai-org/GLM-5-FP8) is the practical choice for self-hosted deployment, reducing memory requirements while maintaining benchmark parity.
MIT License
Fully permissive commercial license. Use in production, modify, redistribute, build products on top. No usage restrictions. This is unusual for a frontier-class model.
DeepSeek Sparse Attention (DSA)
Standard transformer attention computes interactions between every pair of tokens in a sequence. For a sequence of length N, that is N² operations. At 200K tokens, that is 40 billion attention computations per layer per head. Dense attention at 200K context is extraordinarily expensive.
DeepSeek Sparse Attention, originally developed by DeepSeek and adopted by Zhipu for GLM-5, replaces full quadratic attention with sparse patterns. Instead of attending to all N tokens, each position attends to a structured subset: local window neighbors, strided distant tokens, and a small set of globally attended positions. The key token pairs that carry semantic weight get computed. Most of the N² interactions, which are near-zero anyway, get skipped.
The effect for GLM-5: long-context inference is substantially cheaper than it would be with dense attention. The 200K context window becomes practical to use, not just theoretically supported. DSA also reduces training compute, which lets the team run more post-training experiments at 744B scale.
Why DSA is an architectural import, not original research
DeepSeek Sparse Attention was developed and published by DeepSeek. Zhipu AI integrated it into GLM-5 rather than designing a new sparse attention mechanism. This is normal in open model development, similar to how many models adopted rotary position embeddings (RoPE) after Meta published them. The technical documentation identifies the model architecture as glm_moe_dsa, making the import explicit.
Async RL Training with SLIME
Standard reinforcement learning from human feedback (RLHF) operates in three serial phases: generate a batch of completions, score them with a reward model, update the policy weights. At 744B parameters, each phase takes significant wall-clock time and they wait on each other. GPU utilization during scoring is low because the generation step is not running. GPU utilization during generation is low because weight updates are not happening.
SLIME is Zhipu's custom asynchronous RL infrastructure that decouples these phases. Generation and training run in parallel on different parts of the cluster. While one batch of completions is being scored, the next batch is already generating. Policy updates happen asynchronously and get picked up by the generation workers on the next cycle. The result is higher GPU utilization across the training cluster and faster experiment iteration.
The practical impact: Zhipu could run more post-training experiments at full 744B scale. The gap between GLM-4.7 (73.8% SWE-bench) and GLM-5 (77.8%) reflects both the larger pre-trained model and more effective post-training.
Benchmark Results
The three benchmarks that matter most for coding and agentic use cases: SWE-bench Verified, BrowseComp, and Terminal-Bench 2.0.
SWE-bench Verified: 77.8%
SWE-bench Verified presents real GitHub issues from open-source Python repositories. The model receives the issue description and the repository codebase. It must produce a code patch that, when applied, makes the failing tests pass. No human assistance during the task. The Verified variant is a 500-instance human-filtered subset that removes ambiguous or under-specified issues from the original dataset.
77.8% means GLM-5 autonomously resolved 389 out of 500 real software issues. The predecessor scored 73.8%. The improvement comes from both the larger model capacity and more targeted post-training on software engineering tasks.
BrowseComp: 62.0% (75.9% with context management)
BrowseComp tests multi-step web research. Tasks require navigating real websites, synthesizing information across multiple pages, and answering questions where the answer cannot come from model memory alone. The challenge is that relevant information is often buried and requires following links and reading page content.
GLM-5 scores 62.0% baseline and 75.9% with context management enabled. The Chinese-language variant (BrowseComp-ZH) scores 72.7%. For comparison, GLM-4.5 scored 54.0% on BrowseComp according to arxiv search results. The jump reflects better agent loop behavior and improved tool use.
Terminal-Bench 2.0: 56.2%–60.7%
Terminal-Bench evaluates autonomous task completion in a real Linux terminal. The agent gets a goal and a shell. It must navigate the filesystem, run commands, install dependencies, debug failures, and complete multi-step system tasks without a GUI. This is closer to what coding agents face in CI/CD environments and server management tasks than isolated code generation.
GLM-5 scores 56.2% to 60.7% depending on configuration, up from GLM-4.7's 41.0%. The 19-point improvement is the largest single-generation jump in the benchmark.
| Benchmark | GLM-5 | GLM-4.7 | Claude Opus 4.5 | GPT-5.2 |
|---|---|---|---|---|
| SWE-bench Verified | 77.8% | 73.8% | 76.8% | ~80% (est.) |
| BrowseComp | 62.0% | 52.0% | ~40% (est.) | N/A |
| Terminal-Bench 2.0 | 56.2–60.7% | 41.0% | N/A | N/A |
| GPQA Diamond | 86.0% | ~78% | ~83% | N/A |
| AIME 2026 I | 92.7% | N/A | N/A | N/A |
| Context window | 200K | 128K | 200K | 128K |
| Max output | 128K | 32K | 64K (est.) | 16K (est.) |
| License | MIT | MIT | Commercial API | Commercial API |
A note on comparisons
Estimates for Claude Opus 4.5 and GPT-5.2 reflect publicly available benchmark data. GLM-5 comparisons in the official model card show results against GLM-4.7, DeepSeek-V3.2, Kimi K2.5, Claude Opus 4.5, Gemini 3 Pro, and GPT-5.2. Exact competitor scores on some benchmarks are not published by the respective companies and some cells above are estimates based on available data.
Context and Output Limits
The context window is 202,752 tokens for reasoning tasks with tool use. The API exposes this as a 200K input limit. Maximum output is 128K tokens.
The 128K output limit is notable. Most models limit output to 8K or 16K tokens. Claude Opus 4.6 outputs up to 32K (though it often stops earlier). GPT-5.3's output limit is around 16K. GLM-5 at 128K can generate complete large codebases, extensive documentation, or long extended reasoning chains in a single response. The inference cost is real, but the capability is there.
For comparison: Gemini 3.1 Pro supports 1M context input, which is 5x GLM-5's limit. Claude 3.7 reached 200K. For most software engineering tasks, 200K is sufficient for an entire medium-sized codebase. The 1M context becomes relevant for very large monorepos or when you need to include extensive documentation alongside the code.
Where to Access GLM-5
Three access paths: managed API, third-party providers, and self-hosted weights.
Z.ai API
Official API at api.z.ai. Compatible with OpenAI SDK via custom base URL. $1.00/M input, $3.20/M output. Requires 'GLM Coding Plan Pro or Max' subscription. Chat interface at chat.z.ai.
Third-Party Providers
Available through 11 providers according to Artificial Analysis. Includes options with varying latency and pricing. 62 tokens/second output speed measured across providers. Check artificialanalysis.ai/models/glm-5 for current provider list.
Self-Hosted Weights
Download from zai-org/GLM-5 or zai-org/GLM-5-FP8 on Hugging Face. MIT license. Requires 8x H100 80GB minimum (tensor-parallel-size 8). Runs on vLLM, SGLang, KTransformers, and xLLM (Ascend NPU).
Best access path today
For most developers: the Z.ai API with OpenAI SDK compatibility. Point your existing OpenAI SDK client at https://api.z.ai/api/paas/v4 and swap the model name to glm-5. No code changes beyond the base URL and API key. The third-party providers are useful if you need lower latency or are hitting Z.ai rate limits, but availability varies.
The Compute Constraint
GLM-5 at 744B parameters is one of the largest open-weight models available. This creates real deployment friction. The FP8 quantized weights still require 8 H100 80GB GPUs for local inference with vLLM or SGLang. That is approximately $30,000–40,000 in GPU hardware, or $50+/hour in cloud compute. Self-hosting GLM-5 is not a weekend project.
At 62 tokens per second through API providers, a 128K-token output takes about 34 minutes of generation time. For batch processing or automated pipelines, this is workable. For interactive use where you are waiting on a response, 62 tok/s is slower than typing speed at large output lengths.
Demand on Z.ai's managed API has consistently exceeded supply during peak hours since launch. Third-party providers have helped distribute load, but as of March 2026, GLM-5 is not as reliably available as Claude or GPT-5 during peak use periods. The model is real and the benchmarks are real, but treating it as production infrastructure requires planning for availability gaps.
What the compute constraint means for production use
If you are evaluating GLM-5 for a production coding agent: plan for API availability variability. Build retry logic and fallback routing to a secondary model. If you are self-hosting, budget for 8x H100 or equivalent Ascend NPU hardware. The model's performance justifies the infrastructure investment for teams doing high-volume software engineering automation, but it is not yet drop-in reliable for latency-sensitive interactive products.
Zhipu AI and the GLM Lineage
Zhipu AI (operating as Z.ai) was founded by researchers from Tsinghua University's Knowledge Engineering Group (KEG). The company is backed by Alibaba, Tencent, and Xiaomi, valued at over $5 billion as of 2026, and was planning an IPO. The research team has published under the THUDM GitHub organization since 2020.
The GLM lineage:
GLM-130B (2022) was one of the first large open bilingual models competitive with GPT-3. ChatGLM (2023) was Zhipu's first chat-focused fine-tuned model, which became widely used in China. GLM-4 (2024) reached GPT-4-level performance on several benchmarks and added 128K context. GLM-4.5 (early 2026) added integrated reasoning and agent capabilities. GLM-5 (February 2026) scales to 744B and reaches SOTA on coding and agentic tasks.
The company claims 2.7 million developers use GLM models. The open release strategy (MIT license, downloadable weights) differs from most Chinese AI companies, which offer only API access. This has driven adoption outside China and made GLM-5 relevant to the global open-weights ecosystem.
Frequently Asked Questions
What is GLM-5?
GLM-5 is Zhipu AI's flagship language model, released February 11, 2026. It uses a Mixture of Experts architecture with 744 billion total parameters and 40 billion active per token. It scores 77.8% on SWE-bench Verified, supports 200K input context and 128K output tokens, and is released under the MIT license.
How many parameters does GLM-5 have?
GLM-5 has 744 billion total parameters with 40 billion active during inference. It is a Mixture of Experts model, so only a fraction of parameters activate for each input. The predecessor GLM-4.7 had 355 billion total with 32 billion active.
What is GLM-5's SWE-bench score?
GLM-5 scores 77.8% on SWE-bench Verified, the benchmark for autonomous code repair on real GitHub issues. This is up from GLM-4.7's 73.8%. Claude Opus 4.5 scored 76.8% on the same benchmark around the same release window.
What is BrowseComp and how does GLM-5 perform?
BrowseComp tests AI agents on multi-step web research tasks requiring navigation across multiple real pages, information synthesis, and answering questions that cannot come from model memory. GLM-5 scores 62.0% on BrowseComp-EN (75.9% with context management) and 72.7% on BrowseComp-ZH. GLM-4.5 previously scored 54.0%.
What is DeepSeek Sparse Attention in GLM-5?
DeepSeek Sparse Attention (DSA) replaces standard quadratic attention with sparse patterns, computing only important token pairs rather than all N² interactions. This reduces inference and training costs on long sequences, making the 200K context window practical. DSA was developed by DeepSeek and integrated into GLM-5's architecture.
Where can I access GLM-5?
Via Z.ai's API at api.z.ai (OpenAI-compatible SDK), through 11 third-party providers listed on artificialanalysis.ai, or via self-hosted weights at zai-org/GLM-5 on Hugging Face (MIT license, requires 8x H100 80GB minimum). Input: $1.00/M tokens. Output: $3.20/M tokens.
GLM-5 generates code. Morph applies it.
GLM-5 at 77.8% SWE-bench produces edit instructions fast. Morph's Fast Apply model applies those edits to your codebase at 10,500+ tokens per second, deterministically. The generation side is solved. The application side needs to keep up.