---
title: "GLM-5.2: An Open Model That Codes Like a Closed One"
url: "https://www.morphllm.com/blog/glm-5-2"
description: "Z.ai's GLM-5.2 lands within a few points of Claude Opus 4.8 on agent benchmarks at a fraction of the cost. Open weights don't serve themselves — here's how we make it fast, and how to reach it on Morph and OpenRouter."
date: "2026-07-01"
author: "Tejas Bhakta"
---
# GLM-5.2: An Open Model That Codes Like a Closed One

On June 13, Z.ai shipped GLM-5.2. Open weights, MIT license, a 744B mixture-of-experts with a 1M-token context window. The part that made people stop scrolling was the eval sheet: on the agent benchmarks that decide whether a model can actually drive a coding loop, it lands a few points behind Claude Opus 4.8 and ahead of GPT-5.5 — for a sixth of the price.

Here are Z.ai's numbers, run at max thinking effort:

| Benchmark | GLM-5.2 | GPT-5.5 | Claude Opus 4.8 |
|---|---|---|---|
| SWE-bench Pro | 62.1 | 58.6 | — |
| Terminal-Bench 2.1 | 81.0 | — | 85.0 |
| MCP-Atlas (tool use) | 77.0 | 75.3 | 77.8 |
| FrontierSWE | 74.4 | 72.6 | 75.1 |

These are self-reported, on the vendor's own harness; independent evals are still landing, so read them as directional. But the shape is the story. A year ago the open-weight field trailed the closed frontier by a generation on anything agentic. GLM-5.2 trails it by a rounding error on tool use and long-horizon coding, and you can download the weights.

The catch is the one nobody puts on the launch slide: **open weights don't serve themselves.**

## A 744B MoE at batch 1 is a bandwidth problem

Download GLM-5.2, point vLLM at it, and you'll get a model that's correct and slow. Not because the weights are slow — because the default serving stack was written for a different workload than the coding agent.

An agent generates one request at a time, waiting on the last token before it can plan the next tool call. That's batch 1, and at batch 1 decode is memory-bound: the matmul reads every active expert's weights once per token and does almost no arithmetic per byte. Tokens per second is bandwidth divided by bytes touched. The only lever is the bytes — and the stock kernels give the win back. They store the weights in FP4, then dequantize to bf16 in HBM before the tensor core ever sees them, handing back the 4x they just saved.

We wrote about the full stack in [Optimizing Models to Be Fast at Codegen](/blog/codegen-inference-research). The three things the open stack won't do, applied to GLM-5.2:

- **Train the speculator.** A small draft model guesses the next few tokens; the target checks them in one pass. A generic draft off the internet gets [1.93x; one trained on the target's own coding output gets 3.07x](https://arxiv.org/html/2503.01840v1). Code is the highest-speedup task there is — generated code reuses the symbols already on screen, and an edit is mostly a copy of the file it edits. We train one drafter per open model on coding traces, not web text.
- **Keep FP4 at 4 bits.** A real FP4 decode kernel streams the packed weights straight to Blackwell's tensor cores and never materializes bf16 — a grouped GEMM over only the active experts, persistent so every SM stays resident at batch 1. Our [warp-decode kernels](https://github.com/morphllm/fp4-warp-decode) run an 80B MoE at 162 tok/s on a $7K card, past a $25K H100's 120, no accuracy loss.
- **Write the interconnect.** The affordable GPUs have no NVLink; PCIe moves 14x less. We replace NCCL's ring all-reduce with a one-shot push over PCIe P2P, overlap it with the next layer's GEMM, and cross NVLink-denied boxes with a prefix cache over plain TCP.

Same GLM-5.2 weights you can pull from Hugging Face. Ours are faster because the speculator riding them was trained, by us, on the work.

## Why this model, specifically, is worth tuning for

Every layer of that stack points at one workload: the coding agent. GLM-5.2 is a good bet to point it at, for two reasons that show up right on its eval sheet.

The 1M context means whole repositories and long agent trajectories fit without eviction — the [37x-to-2,494x prompt-to-output ratio](https://arxiv.org/html/2407.00023v2) of programming traffic is exactly the regime where a hot prefix cache pays for itself. And the benchmarks it's strongest on — MCP-Atlas, Tool-Decathlon — are tool-use benchmarks, not chat. That's the direction everything is going. [Every agent becomes a coding agent](/blog/all-agents-coding-agents): the moment an agent needs to touch a real system, the general path is to write and run code. A model that scores 77 on tool orchestration and holds a repo in context is a model built for that loop.

## Using it on Morph

It's one OpenAI-compatible endpoint. Change the model string, keep your client:

```typescript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.morphllm.com/v1",
  apiKey: process.env.MORPH_API_KEY,
});

const res = await client.chat.completions.create({
  model: "morph-glm52-744b",
  messages: [{ role: "user", content: "Refactor this module..." }],
});
```

Served per token, billed at $1.10 / 1M input and $4.10 / 1M output. No per-seat fees, no minimums. The [full lineup](/products/models) — Qwen 3.5, MiniMax M3, DeepSeek V4 Flash, GLM-5.2 — sits behind the same base URL, so switching models is a one-string diff.

## On OpenRouter, too

GLM-5.2 lists 13-plus providers on [OpenRouter](https://openrouter.ai/z-ai/glm-5.2), and the router sends your request to whichever one fits your price or latency constraint. The providers are not interchangeable: published output speeds for the same weights range from the low 300s to 471 tok/s, because serving is where the model gets fast or stays slow — the exact gap this post is about.

Morph is already a [top-25 provider on OpenRouter](https://openrouter.ai/morph). Now the open models we tune for codegen — GLM-5.2 included — are reachable there as well as directly. If you're routing through OpenRouter, pin Morph and you get the codegen-tuned path: the trained speculator, the FP4 kernels, the hot prefix cache. If you're calling us directly, you already have it.

Open weights closed the quality gap. A team still defaulting to a closed frontier model isn't protecting quality anymore — the eval sheet took that argument away. It's paying six times the price for a habit. The speed gap that's left is a serving problem, and serving is a choice. [Morph's inference stack is one import away.](https://docs.morphllm.com)