Qwen3-Coder: Specs, Benchmarks, and API Access (480B + 30B)

Qwen3-Coder is Alibaba's open-weight coding model family. The flagship Qwen3-Coder-480B-A35B-Instruct is a 480B-parameter Mixture-of-Experts model with 35B active, 256K native context extendable to 1M via YaRN, Apache 2.0 license. A 30.5B-A3B variant runs the same architecture at laptop scale. Both expose an OpenAI-compatible API.

June 18, 2026 · 2 min read
Qwen3-Coder: Specs, Benchmarks, and API Access (480B + 30B)

Qwen3-Coder is Alibaba's open-weight coding model family. The flagship Qwen3-Coder-480B-A35B-Instruct is a Mixture-of-Experts model with 480B total parameters and 35B active, a 256K native context window extendable to 1M tokens with YaRN, released under Apache 2.0. A 30.5B-A3B variant runs the same architecture at laptop scale. Both expose OpenAI-compatible APIs, and Morph serves Qwen-family models behind a routing layer.

480B / 35B
Total / active parameters (MoE)
256K
Native context (1M via YaRN)
Apache 2.0
License (commercial use OK)
160 / 8
Total / active experts per token

What Is Qwen3-Coder

Qwen3-Coder is the coding-specialized line within the Qwen3 family, developed by the Qwen Team at Alibaba. The flagship checkpoint is Qwen3-Coder-480B-A35B-Instruct. It is a causal language model that runs in non-thinking mode, meaning it does not emit a separate chain-of-thought block before its answer.

The model was pretrained on 7.5 trillion tokens with a 70% code ratio, per the Qwen team blog. That code-heavy corpus is the core difference from general-purpose Qwen3 checkpoints. The model is tuned for repository-scale edits, multi-file changes, and agentic tool use rather than broad conversation.

Two instruct variants are public: the 480B-A35B flagship and a smaller 30B-A3B model. Both are released on Hugging Face under Apache 2.0, so the weights can be downloaded, modified, and deployed commercially. There is no closed-API gate on the weights themselves.

What 'A35B' means

The suffix in Qwen3-Coder-480B-A35B-Instruct encodes the architecture: 480B total parameters, A35B = 35B activated per token. In a Mixture-of-Experts model, only a subset of experts fires for each token, so serving cost tracks the active count (35B), not the total (480B). The 30B-A3B variant activates just 3.3B.

Qwen3-Coder Variants and Specs

Two variants share the same recipe at different scales. The table below maps each to its parameters, expert count, context window, and license. All numbers are from the official Hugging Face model cards.

SpecQwen3-Coder-480B-A35BQwen3-Coder-30B-A3B
Total parameters480B30.5B
Active parameters35B3.3B
Experts (total / active)160 / 8128 / 8
Layers6248
Attention heads (Q / KV)96 / 8 (GQA)32 / 4 (GQA)
Native context256K (262,144)256K (262,144)
Extended context (YaRN)1M1M
ModeNon-thinkingNon-thinking
LicenseApache 2.0Apache 2.0

The 30B-A3B model is not a distillation of the 480B model. It is a separately released checkpoint with its own expert count (128 vs 160) and layer depth (48 vs 62). The shared trait is the 256K native context and Apache 2.0 license, so a team can prototype on the small model and scale to the large one without changing the serving contract.

Parameters and Architecture

Qwen3-Coder-480B-A35B is a fine-grained Mixture-of-Experts transformer. Of its 160 experts, 8 are routed per token, so each forward pass touches 35B of the 480B total parameters. This is the standard MoE tradeoff: the model holds frontier-scale knowledge in its weights but pays mid-size inference cost per token.

The 62-layer stack uses grouped-query attention with 96 query heads and 8 key/value heads. The 12:1 query-to-KV ratio shrinks the KV cache, which is what makes the 256K and 1M context windows affordable to serve. The 30B variant follows the same GQA pattern at 32 query heads and 4 KV heads.

Running in non-thinking mode is a deliberate choice for agentic coding. The model returns edits and tool calls directly without spending tokens on a visible reasoning trace, which keeps latency and token cost down inside multi-turn agent loops where every turn pays for the full context.

Fine-grained MoE

160 experts with 8 active (480B model) means the model stores frontier-scale capacity but activates only 35B per token. Serving cost tracks the active count, not the total.

Grouped-query attention

96 query heads share 8 key/value heads. The 12:1 ratio compresses the KV cache, which is what makes 256K and 1M context windows economical to serve.

Non-thinking mode

No separate reasoning block. The model emits edits and tool calls directly, cutting token cost inside agentic loops where every turn re-pays for the full context.

Code-heavy pretraining

7.5T tokens at a 70% code ratio. This is the difference from general Qwen3 checkpoints and the reason the model is tuned for repository-scale, multi-file edits.

Context Window: 256K Native, 1M Extended

Both Qwen3-Coder variants ship with a 262,144-token native context window (256K). For most coding tasks, including multi-file refactors and reading a moderate-sized repository into the prompt, 256K is sufficient without any configuration change.

The context window extends to 1M tokens using YaRN, a position-scaling method that lets the model attend over sequences longer than it was trained on. YaRN is not free: it is enabled in the serving config and can introduce a small quality cost on shorter prompts, so the recommendation is to enable it only when you actually need windows past 256K.

Tradeoff: enable YaRN only when you need it

The 1M context is an extension, not the native window. Enabling YaRN globally can slightly degrade quality on the short and medium prompts that make up most of a coding session. Keep the native 256K window for normal work and switch to YaRN-extended serving for the specific workloads (very large monorepos, long agentic sessions) that exceed 256K.

For background on why context length matters and how it interacts with cost and quality, see LLM Context Window.

License: Apache 2.0

Qwen3-Coder-480B-A35B-Instruct and Qwen3-Coder-30B-A3B-Instruct are both released under the Apache 2.0 license. This permits commercial use, modification, redistribution, and self-hosting with no per-token license fee and no usage cap.

Apache 2.0 is more permissive than the source-available or modified licenses that ship with some other open-weight models. For comparison, DeepSeek-R1 is MIT, Kimi K2 ships under a Modified MIT License, and MiniMax-M2 under a modified-MIT license. Apache 2.0 and MIT are both genuinely permissive; modified-MIT variants add usage clauses that legal review should check before deployment.

The practical consequence: a team can download the Qwen3-Coder weights, fine-tune them on a private codebase, and serve the result behind their own product, paying only for GPU compute. That is the central reason open-weight coding models are pulling work away from closed APIs.

Coding Benchmarks

The Qwen team reports Qwen3-Coder-480B-A35B as state-of-the-art among open-source models on SWE-Bench Verified without test-time scaling. SWE-Bench Verified measures whether a model can resolve real GitHub issues by editing a repository and passing the project's test suite, which is the closest public proxy for agentic coding ability.

The table below places Qwen3-Coder against other open coding models on SWE-bench Verified using each model's official-card number. Scores are not always measured under identical harnesses (tool access, scaffolding, and number of attempts vary), so treat small gaps as noise and read the ranking, not the decimal.

ModelTotal / active paramsSWE-bench VerifiedLicense
Kimi K2 Thinking1T / 32B71.3 (with tools)Modified MIT
MiniMax-M2230B / 10B69.4Modified MIT
DeepSeek-V3.2-Exp685B / —67.8MIT
gpt-oss-120b117B / 5.1B62.4Apache 2.0
Qwen3-Coder-480B480B / 35BSOTA open (per Qwen)Apache 2.0

Why Qwen3-Coder's score is reported as a claim, not a number

The official Qwen3-Coder model card and blog state the model is state-of-the-art among open models on SWE-Bench Verified without test-time scaling, but do not publish a single canonical SWE-bench number on the card itself. The facts pack used for this page only includes verified numbers, so the exact decimal is omitted here rather than guessed. The comparison models above show published figures for context.

Qwen3-Coder vs DeepSeek vs Claude

The honest framing is a three-way tradeoff between accuracy on the hardest tasks, cost per token, and control. Qwen3-Coder and DeepSeek are open-weight and self-hostable. Claude is a closed frontier API that leads on the hardest reasoning tasks but costs more per token and cannot be self-hosted.

DeepSeek-V3.2-Exp (685B MoE, MIT) scores 67.8 on SWE-bench Verified and adds DeepSeek Sparse Attention for long-context efficiency, with a 160K context window. Qwen3-Coder counters with a larger native context (256K, 1M extended) and an Apache 2.0 license. Both trail closed frontier models on the most complex multi-step debugging.

For a coding agent, the cost-correct answer is rarely a single model. Send the hardest architectural and debugging turns to a frontier model, and route the high-volume easy and medium turns (boilerplate, renames, docstrings, simple edits) to a Qwen3-Coder-class open model. This is exactly what an LLM router automates.

DimensionQwen3-Coder-480BDeepSeek-V3.2-ExpClosed frontier (Claude)
WeightsOpen (Apache 2.0)Open (MIT)Closed
Self-hostYesYesNo
Native context256K (1M YaRN)160K200K+
SWE-bench VerifiedSOTA open (per Qwen)67.8Higher on hardest tasks
Cost per tokenCompute onlyCompute only / low APIHighest

For a deeper ranking of open coding models and how to combine them, see Best Open-Source Coding Model 2026 and Best AI Model for Coding.

How to Run Qwen3-Coder

Qwen3-Coder is served through OpenAI-compatible chat completions endpoints. You can hit a hosted provider, self-host the Hugging Face weights with vLLM or SGLang, or use the qwen-code CLI agent the Qwen team ships alongside the model. Because the API is OpenAI-shaped, any OpenAI SDK works by changing the base URL and model name.

Call Qwen3-Coder via an OpenAI-compatible API

import OpenAI from "openai"

// Point the OpenAI SDK at any Qwen3-Coder endpoint
// (DashScope, a third-party provider, or your own vLLM/SGLang server)
const client = new OpenAI({
  apiKey: process.env.QWEN_API_KEY,
  baseURL: "https://your-qwen3-coder-endpoint/v1",
})

const response = await client.chat.completions.create({
  model: "qwen3-coder-480b-a35b-instruct",
  messages: [
    { role: "system", content: "You are a coding agent. Return a unified diff." },
    { role: "user", content: "Add input validation to the createUser handler." },
  ],
  max_tokens: 4096,
  temperature: 0,
})

console.log(response.choices[0].message.content)

For an agentic command-line workflow, the Qwen team maintains qwen-code, a terminal agent forked from Gemini CLI and adapted to the Qwen3-Coder prompt format and tool-calling conventions. It runs the model in a read-edit-test loop against a local repository.

Self-host the open weights with vLLM

# 480B model: multi-GPU, FP8 recommended; 30B-A3B fits far smaller
pip install vllm

vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 262144

# For the small variant (3.3B active), a single high-memory GPU is enough:
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
  --max-model-len 262144

# Both expose an OpenAI-compatible /v1/chat/completions endpoint on :8000

256K vs 1M at serve time

The commands above serve the native 256K window. To reach 1M tokens, enable YaRN in the serving config (a rope_scaling override). Do this only for workloads that exceed 256K, since YaRN can slightly reduce quality on shorter prompts.

API Cost and Serving

Because Qwen3-Coder is Apache 2.0 open-weight, there is no single official price. Self-hosting costs only your GPU compute. Hosted providers set their own per-token rates. For a sense of scale, other open coding models on their official APIs run roughly $0.20-0.60 per 1M input tokens (GLM-4.5-Air at $0.20/M input, GLM-4.6 at $0.60/M input, MiniMax-M2 at $0.30/M input).

The dominant serving cost driver for the 480B model is the 35B active parameter count and the KV cache for long contexts, not the 480B total. The 30B-A3B variant activates 3.3B, which is roughly 10x cheaper to serve per token and a better fit for high-volume, latency-sensitive endpoints.

Morph serves Qwen-family models on its production fleet. The qwen35-397b checkpoint runs at ~120 tok/s on Morph's fleet behind an OpenAI-compatible router at api.morphllm.com. The router classifies each prompt's difficulty in ~430ms into four tiers and sends only the hard coding turns to a large Qwen3-Coder-class model, routing the rest to cheaper models. That cuts API spend 40-70% without changing your client code.

~120 tok/s
qwen35-397b on Morph's fleet
~430ms
Router difficulty classification
40-70%
API cost savings with routing
3.3B
Active params, 30B-A3B variant

For the full routing mechanism and cost math, see What Is an LLM Router and LLM Cost Optimization.

Frequently Asked Questions

How many parameters is Qwen3-Coder-480B?

Qwen3-Coder-480B-A35B-Instruct has 480B total parameters with 35B activated per token. It is a Mixture-of-Experts model with 160 total experts (8 active per token) across 62 layers. The smaller sibling, Qwen3-Coder-30B-A3B-Instruct, has 30.5B total and 3.3B active.

What is the Qwen3-Coder context window?

262,144 tokens (256K) native, extendable to 1M tokens with YaRN. Both the 480B and 30B variants share this. The 256K window covers most repository-scale work; the 1M extension targets very large monorepos but requires enabling YaRN in the serving config.

Is Qwen3-Coder free and open source?

Yes. Both variants are Apache 2.0, which permits commercial use, modification, and self-hosting with no per-token license fee. The weights are published on Hugging Face by the Qwen Team at Alibaba. You pay only for the compute that runs them.

How do I use Qwen3-Coder?

Through an OpenAI-compatible chat completions endpoint: Alibaba's DashScope, a third-party inference provider, or your own vLLM/SGLang server running the Hugging Face weights. The Qwen team also ships qwen-code, a terminal agent forked from Gemini CLI. Point any OpenAI SDK at the base URL and pass the model name.

Qwen3-Coder vs DeepSeek vs Claude for coding?

Qwen3-Coder-480B is reported by the Qwen team as state-of-the-art among open models on SWE-Bench Verified without test-time scaling. DeepSeek-V3.2-Exp scores 67.8 on SWE-bench Verified. Both are open-weight and self-hostable but trail closed frontier models like Claude Opus on the hardest tasks. The cost-correct pattern is routing the hardest turns to a frontier model and the rest to a Qwen3-Coder-class open model.

What is the difference between Qwen3-Coder-480B and the 30B variant?

The 480B-A35B model uses 160 experts (8 active) across 62 layers with 35B active parameters. The 30B-A3B model uses 128 experts (8 active) across 48 layers with 3.3B active. Both keep the 256K native context and Apache 2.0 license. The 480B model is stronger on hard tasks; the 30B model is roughly 10x cheaper to serve per token.

How much does the Qwen3-Coder API cost?

There is no single official rate because the weights are Apache 2.0 open. Self-hosting costs only GPU compute. Hosted providers price per token; comparable open coding models run roughly $0.20-0.60 per 1M input tokens. Morph serves Qwen-family models behind a router that cuts spend 40-70% by sending easy turns to cheaper models.

Related Resources

Route Hard Coding Turns to Qwen3-Coder, Cheap Turns Elsewhere

Morph serves Qwen-family models on its production fleet behind an OpenAI-compatible router at api.morphllm.com. Classify prompt difficulty in ~430ms, send hard coding turns to a Qwen3-Coder-class model and easy turns to cheaper models. 40-70% API cost savings, no client rewrite.