GLM-4.6 and GLM-4.5-Air: Specs, Benchmarks, and API (Z.ai / Zhipu)

GLM-4.6 is Z.ai's 357B-parameter Mixture-of-Experts model with a 200K-token context window, released September 30, 2025 under the MIT license. The smaller GLM-4.5-Air is a 106B-A12B model with a 128K context. This page covers specs, verified benchmarks, MIT licensing, and the OpenAI-compatible API for both, with prices straight from Z.ai's pricing page.

June 18, 2026 · 2 min read
GLM-4.6 and GLM-4.5-Air: Specs, Benchmarks, and API (Z.ai / Zhipu)

GLM-4.6 is Z.ai's flagship language model: a 357B-parameter Mixture-of-Experts model with a 200K-token context window, released September 30, 2025 under the MIT license. Its smaller sibling, GLM-4.5-Air, is a 106B-A12B model with a 128K context. Both are MIT-licensed, self-hostable, and OpenAI-compatible at the API. Morph serves a GLM-family model on its own fleet.

357B
GLM-4.6 total parameters (MoE)
200K
GLM-4.6 context window
MIT
License (commercial use OK)
106B-A12B
GLM-4.5-Air parameters

What GLM-4.6 Is

GLM-4.6 is the flagship model from Z.ai, the model arm of Zhipu AI. GLM stands for General Language Model, the family name Zhipu has used across releases. GLM-4.6 was released September 30, 2025, per Z.ai's release notes, and its weights are published on Hugging Face under the zai-org organization.

The model uses a Mixture-of-Experts (MoE) architecture with 357B total parameters, per the Z.ai model card. In an MoE model, only a subset of the parameters activates for any given token, which keeps inference cheaper than a dense model of the same total size. Z.ai has not published the active-parameters-per-token figure for GLM-4.6, so this page states only the 357B total and does not guess at an active count.

GLM-4.6 targets coding, agentic workflows, and long-context reasoning. The jump from GLM-4.5 to GLM-4.6 most visibly widened the context window from 128K to 200K tokens, which matters for multi-file code edits and long agent traces that overflow a smaller window.

GLM-4.6 Specs at a Glance

Every number below comes from Z.ai's model card on Hugging Face or its documentation. Where Z.ai has not published a value (such as GLM-4.6 active parameters), the row is omitted rather than estimated.

AttributeValueSource
Total parameters357B (MoE)Z.ai model card
Context window200K tokensZ.ai docs
Max output length128K tokensZ.ai docs
LicenseMITZ.ai model card
Release dateSeptember 30, 2025Z.ai release notes
MakerZ.ai / Zhipu AIZ.ai

A note on the parameter count

The 357B figure is the total parameter count from the GLM-4.6 model card. The active parameters per token were not published for GLM-4.6 in this fact set, so no active-param figure is stated. By contrast, GLM-4.5-Air publishes both: 106B total and 12B active (106B-A12B).

GLM-4.6 vs GLM-4.5-Air

GLM-4.5-Air is the lighter model in the GLM-4.5 series, released July 28, 2025. It is a hybrid-reasoning MoE model with 106B total parameters and 12B active per token, a 128K context window, and the same MIT license as the flagship. Z.ai shipped FP8 and base/reasoning variants of Air.

The choice between them is a cost-versus-capability tradeoff. GLM-4.6 has roughly 3.4x the total parameters and a wider 200K context, so it handles harder reasoning and longer inputs. GLM-4.5-Air costs about a third as much on the Z.ai API and serves faster, which suits high-volume or latency-sensitive work where the flagship is overkill.

AttributeGLM-4.6GLM-4.5-AirBest use
Total parameters357B (MoE)106B (MoE)Flagship for hard tasks
Active parametersNot published12BAir cheaper to serve
Context window200K128KFlagship for long inputs
LicenseMITMITBoth self-hostable
Input price (Z.ai API)$0.60/1M$0.20/1MAir for high volume
Output price (Z.ai API)$2.20/1M$1.10/1MAir for high volume

A common pattern is to use both: GLM-4.5-Air for routine turns (boilerplate, simple edits, classification) and GLM-4.6 for the harder reasoning turns. That is the same difficulty-based split an LLM router automates across a model tier.

Verified Benchmarks

Published benchmark scores for these models are reported per-model on the Z.ai model cards. For GLM-4.5-Air, Z.ai reports a GPQA Diamond score of 71.72 and an MMLU-Pro score of 81.4. These are the verified GLM-4.5-Air figures in this fact set.

BenchmarkGLM-4.5-AirMeasures
GPQA Diamond71.72Graduate-level science reasoning
MMLU-Pro81.4Broad multi-domain knowledge

Why GLM-4.6 coding scores are not tabled here

This page only states benchmark numbers that are verified in the fact set. A directly comparable GLM-4.6 SWE-bench Verified or LiveCodeBench figure was not in the verified set, so it is omitted rather than recalled from memory. For an open-source coding-model comparison built on verified scores across models, see Best Open-Source Coding Model 2026.

License and Open Weights

GLM-4.6 is released under the MIT license, per the Z.ai model card. The MIT license is one of the most permissive open-source licenses: it allows commercial use, modification, fine-tuning, redistribution, and self-hosting, with no per-use royalty owed to Z.ai. GLM-4.5-Air carries the same MIT license and is explicitly usable commercially.

This puts GLM in the same open-weights bracket as DeepSeek (MIT) and Qwen (Apache 2.0), and apart from closed API-only frontier models like Claude. If you need to run the model inside your own infrastructure for data-residency, latency, or cost-control reasons, an MIT license removes the legal blocker. You still need the GPUs to serve a 357B MoE model, which is the practical constraint.

API Pricing

Z.ai prices the GLM models on its hosted API. The numbers below are from Z.ai's pricing page. Self-hosting the open weights changes the cost equation entirely (you pay for GPUs and operations instead of per-token), so these prices apply to the Z.ai-hosted API specifically.

ModelInputOutputNotes
GLM-4.6$0.60$2.20Cached input 85% off
GLM-4.5-Air$0.20$1.10Lighter, faster

GLM-4.6 cached input is billed at an 85% discount versus standard input, per Z.ai's pricing page. For agent workloads that resend a large stable system prompt and context across turns, that cache discount is a material part of the real bill, not a rounding detail.

$0.60/1M
GLM-4.6 input
$2.20/1M
GLM-4.6 output
$0.20/1M
GLM-4.5-Air input
85%
GLM-4.6 cached-input discount

Calling the GLM API

Both GLM-4.6 and GLM-4.5-Air speak the OpenAI chat-completions protocol. You initialize the OpenAI SDK with Z.ai's base URL and your Z.ai key, then pass glm-4.6 or glm-4.5-air as the model. Everything else (messages, streaming, temperature) matches a standard OpenAI-SDK call.

Call GLM-4.6 with the OpenAI SDK (TypeScript)

import OpenAI from "openai";

// Z.ai exposes an OpenAI-compatible endpoint.
const client = new OpenAI({
  apiKey: process.env.ZAI_API_KEY,
  baseURL: "https://api.z.ai/api/paas/v4",
});

const response = await client.chat.completions.create({
  model: "glm-4.6", // or "glm-4.5-air" for the lighter model
  messages: [
    { role: "system", content: "You are a senior backend engineer." },
    { role: "user", content: "Refactor this handler to use async/await." },
  ],
});

console.log(response.choices[0].message.content);

Because the protocol is OpenAI-compatible, swapping GLM-4.6 in for another model is a base-URL and model-name change, not a rewrite. The same property lets a router sit in front of multiple OpenAI-compatible backends and pick a model per request. Verify the exact base URL against Z.ai's current API docs before deploying.

Morph Runs a GLM-Family Model

Morph builds inference infrastructure for AI coding agents. On its production fleet, Morph serves glm51-754b, a member of the GLM family, alongside qwen35-397b (~120 tok/s), minimax27-230b (~140 tok/s), and others. This is a first-hand datapoint: the GLM line is capable enough that Morph runs a member of it directly for coding-agent workloads.

Morph exposes its fleet through an OpenAI-compatible router at api.morphllm.com/v1. The router classifies prompt difficulty in ~430ms and sends each request to the right model tier, which is the same difficulty-based split described above for pairing GLM-4.5-Air with GLM-4.6. Cheap models handle easy turns, the larger model handles hard ones, and the classification costs about $0.001 per request.

GLM-family on the fleet

Morph serves glm51-754b, a GLM-family model, on its own production GPU fleet for coding-agent workloads.

OpenAI-compatible router

One endpoint at api.morphllm.com classifies prompt difficulty in ~430ms and routes to the right model tier.

Cost-aware routing

40-70% API cost savings by reserving the largest model for hard turns and routing routine turns to cheaper models.

GLM-4.6 vs Claude and DeepSeek for Coding

The cleanest distinction is licensing. GLM-4.6 is MIT-licensed and self-hostable. Claude is closed and API-only: you cannot run the weights yourself at any price. If self-hosting or weight access is a hard requirement, GLM-4.6 and Claude are not in the same category, regardless of benchmark scores.

Against DeepSeek, the comparison is open-weights to open-weights. DeepSeek-V3.2-Exp is a 685B MoE model, MIT-licensed, with a 160K context, and it scores 67.8 on SWE-bench Verified and 74.1 on LiveCodeBench per its model card. GLM-4.6 brings a wider 200K context window and a smaller 357B total parameter count. A directly comparable GLM-4.6 SWE-bench figure was not in this verified set, so the honest comparison is on the attributes that are verified rather than a claimed coding-score winner.

ModelTotal paramsContextLicenseSWE-bench Verified
GLM-4.6357B (MoE)200KMITNot in set
GLM-4.5-Air106B (MoE)128KMITNot in set
DeepSeek-V3.2-Exp685B (MoE)160KMIT67.8

For an open-source coding-model comparison built entirely on verified per-model scores, see Best Open-Source Coding Model 2026. For routing across these models through one API, see LLM Router.

Frequently Asked Questions

How many parameters does GLM-4.6 have?

GLM-4.6 has 357B total parameters and uses a Mixture-of-Experts architecture, per Z.ai's model card. Z.ai has not published an active-parameters-per-token figure for GLM-4.6, so only the 357B total is stated. The smaller GLM-4.5-Air is a 106B-A12B model: 106B total with 12B active per token.

What is the context window of GLM-4.6?

GLM-4.6 has a 200K-token context window, up from 128K in GLM-4.5. Its maximum output length is 128K tokens. GLM-4.5-Air has a 128K-token context window.

Is GLM-4.6 open source? What license is it under?

Yes. GLM-4.6 is released under the MIT license, per the Z.ai model card. MIT permits commercial use, fine-tuning, and self-hosting with no per-use fee to Z.ai. GLM-4.5-Air is also MIT-licensed and usable commercially.

What is the difference between GLM-4.6 and GLM-4.5-Air?

GLM-4.6 is the 357B flagship with a 200K context, released September 30, 2025. GLM-4.5-Air is a 106B-A12B model with a 128K context, part of the GLM-4.5 series from July 28, 2025. The flagship handles harder reasoning and longer inputs; Air is cheaper and faster. On the Z.ai API, GLM-4.6 is $0.60/$2.20 per 1M input/output tokens versus $0.20/$1.10 for Air.

How do I use the GLM API?

Both models are OpenAI-compatible. Point the OpenAI SDK at Z.ai's base URL, set the model to glm-4.6 or glm-4.5-air, and send a standard chat-completions request. The message shape is unchanged, so existing OpenAI-SDK integrations can swap GLM in without a rewrite.

How does GLM-4.6 compare to Claude and DeepSeek for coding?

GLM-4.6 is MIT-licensed and self-hostable; Claude is API-only and closed. Against DeepSeek-V3.2-Exp (MIT, 160K context, 67.8 on SWE-bench Verified), GLM-4.6 brings open weights and a larger 200K context. A directly comparable GLM-4.6 SWE-bench figure is not in this verified set, so compare verified per-model scores rather than assuming a winner.

Who makes GLM-4.6?

GLM-4.6 is made by Z.ai, the model arm of Zhipu AI, a Chinese AI company. The GLM (General Language Model) series includes GLM-4.5, GLM-4.5-Air, and GLM-4.6. Weights are published on Hugging Face under the zai-org organization.

Related Resources

Run GLM-Family Models Through One OpenAI-Compatible API

Morph serves a GLM-family model on its production fleet and exposes an OpenAI-compatible router at api.morphllm.com. Classify prompt difficulty in ~430ms, route easy turns to cheap models and hard turns to the large one, and cut API costs 40-70%.