GPT-OSS (120B and 20B): OpenAI's Open-Weight Models, Specs and Benchmarks

OpenAI released gpt-oss-120b and gpt-oss-20b on August 5, 2025 under Apache 2.0. Both are Mixture-of-Experts models: 116.83B total / 5.13B active and 20.91B total / 3.61B active, each with a 131,072-token context window and three reasoning-effort levels. gpt-oss-120b runs on a single 80GB GPU; gpt-oss-20b fits in 16GB.

June 18, 2026 · 2 min read
GPT-OSS (120B and 20B): OpenAI's Open-Weight Models, Specs and Benchmarks

GPT-OSS is OpenAI's first open-weight release since GPT-2: gpt-oss-120b and gpt-oss-20b, shipped August 5, 2025 under Apache 2.0. Both are Mixture-of-Experts models with a 131,072-token context and three reasoning-effort levels. gpt-oss-120b has 116.83B total / 5.13B active parameters and runs on a single 80GB GPU; gpt-oss-20b has 20.91B total / 3.61B active and fits within 16GB. Morph serves open models like these on its fleet.

Apache 2.0
License (commercial use, no cap)
116.83B / 5.13B
120b total / active params
20.91B / 3.61B
20b total / active params
128k
Context window (both models)

What GPT-OSS Is

GPT-OSS is a pair of open-weight language models from OpenAI, released August 5, 2025: gpt-oss-120b and gpt-oss-20b. It is OpenAI's first open-weight release since GPT-2 in 2019. The weights are public, downloadable, and licensed permissively, which is a different posture from the closed GPT-4 and GPT-5 API models.

Both models are Mixture-of-Experts (MoE) transformers. In an MoE model, a router selects a small subset of expert sub-networks for each token, so the active parameter count (and therefore the compute per token) is far smaller than the total. This is why gpt-oss-120b, with 116.83B total parameters, runs at the speed of a roughly 5B-active model.

The pair targets two deployment points. gpt-oss-120b is the higher-accuracy model that still fits on a single datacenter GPU. gpt-oss-20b is the smaller model built to run on consumer hardware and laptops. Same architecture family, same context window, same reasoning-effort controls, different size and hardware footprint.

Open-weight, not fully open-source

OpenAI calls these open-weight models. The Apache 2.0 license covers the released weights and inference code. The training dataset and full training pipeline are not published, so you can run and fine-tune the model but you cannot reproduce its training from scratch.

Is GPT-OSS Really Open Source?

Yes, under Apache 2.0. Apache 2.0 is one of the most permissive widely-used open-source licenses. It grants the right to use, modify, distribute, and sublicense the work, including in closed-source commercial products, with an explicit patent grant and no copyleft requirement. There is no revenue cap and no acceptable-use clause embedded in the license text.

This is a stronger grant than the license on some other open models. Meta's Llama 3.1 models ship under the Llama 3.1 Community License, which adds an acceptable-use policy and a clause requiring a separate license once a product crosses 700 million monthly active users. Apache 2.0 has neither restriction.

The practical consequence: you can fork gpt-oss, fine-tune it on private data, serve it to paying customers, and never send a request to OpenAI. The tradeoff is that OpenAI publishes weights and inference code but not the training corpus, so "open source" here means open weights plus a permissive license, not a fully reproducible training run.

For a broader look at permissively-licensed coding models, see the best open-source coding model in 2026.

Parameters and Architecture

Both models are MoE transformers with top-4 expert routing. The difference is scale: gpt-oss-120b has more experts and more layers, so it carries more total knowledge while keeping active compute small.

Specgpt-oss-120bgpt-oss-20b
Total parameters116.83B20.91B
Active parameters / token5.13B3.61B
Total experts12832
Active experts / tokentop-4top-4
Layers3624
Context window131,072 (128k)131,072 (128k)

Active parameters drive inference cost. gpt-oss-120b activates 5.13B parameters per token and gpt-oss-20b activates 3.61B. A dense 116B model would run roughly 20x slower than gpt-oss-120b at the same hardware, which is the entire point of the MoE design: frontier-adjacent knowledge at small-model throughput.

Both models use a 131,072-token context window, large enough for whole-file edits, multi-file diffs, and long agent transcripts. For how context length affects cost and recall, see the LLM context window guide.

Reasoning-Effort Levels

Both gpt-oss models expose three reasoning-effort levels: low, medium, and high. You set the level in the system prompt, so a single checkpoint serves both fast cheap turns and slow careful turns without swapping models.

Higher effort spends more tokens on chain-of-thought before the final answer. That raises accuracy on hard math and reasoning tasks but increases latency and output token cost. The published headline benchmarks (for example AIME 2025 97.9% on gpt-oss-120b) use high reasoning effort with tools. Low effort returns faster and cheaper at a few points of accuracy.

Low effort

Minimal chain-of-thought. Fastest, cheapest output. Best for boilerplate, simple edits, and high-throughput batch work where a few accuracy points do not matter.

Medium effort

Balanced reasoning budget. The default for general use where you want solid accuracy without paying for full deliberation on every turn.

High effort

Maximum chain-of-thought. Produces the published benchmark scores. Best for hard math, complex debugging, and multi-step reasoning where accuracy dominates cost.

Benchmarks

The numbers below are from OpenAI's gpt-oss technical report. Math and agentic scores use high reasoning effort with tools where noted. The 120b model leads on knowledge-heavy benchmarks; the gap narrows sharply on math and coding.

Benchmarkgpt-oss-120bgpt-oss-20b
MMLU90.0%85.3%
GPQA Diamond80.9%74.2%
SWE-Bench Verified62.4%60.7%
AIME 2024 (tools, high)96.6%96.0%
AIME 2025 (tools, high)97.9%98.7%
Codeforces Elo (high)26222516
HealthBench57.6%42.5%
Tau-Bench Retail67.8%54.8%
HLE (tools, high)19.0%17.3%

Two patterns stand out. First, on AIME 2025 the 20b model (98.7%) slightly edges the 120b (97.9%), which shows that for well-specified math with tools, the smaller model is not a meaningful step down. Second, the gap widens on knowledge and agentic tasks: HealthBench (57.6% vs 42.5%) and Tau-Bench Retail (67.8% vs 54.8%) favor the 120b by double digits, because broad world knowledge scales with total parameters.

Tradeoff: knowledge breadth vs hardware

The 20b is not a free win. On HealthBench it scores 42.5% against the 120b's 57.6%, a 15-point gap. If your workload is knowledge-heavy (medical, legal, broad factual recall), the 120b is worth the larger GPU. If your workload is math, code, or tool use, the 20b is close enough to run on a laptop.

gpt-oss-120b vs gpt-oss-20b

The choice between the two models is a hardware-and-knowledge tradeoff. The 120b needs a datacenter GPU and leads on broad knowledge. The 20b runs on consumer hardware and stays close on math and code.

Attributegpt-oss-120bgpt-oss-20b
Total parameters116.83B20.91B
Active parameters5.13B3.61B
Context window131,072 (128k)131,072 (128k)
Minimum hardwareSingle 80GB GPU16GB memory
Example GPUH100 / MI300XConsumer GPU / laptop
MMLU90.0%85.3%
SWE-Bench Verified62.4%60.7%
LicenseApache 2.0Apache 2.0

Pick the 120b when you have an 80GB GPU and want the strongest knowledge and agentic performance. Pick the 20b when you want to run on a single consumer card or laptop and your workload is dominated by math, code, and tool use, where it stays within a few points of the larger model.

Hardware to Run Each

OpenAI applies MXFP4 quantization to the Mixture-of-Experts weights, which is the reason both models fit on far less hardware than their total parameter counts suggest. MXFP4 stores the bulk of the weights at roughly 4 bits, cutting memory by about 4x versus FP16.

80GB
Single GPU for gpt-oss-120b
16GB
Memory for gpt-oss-20b
MXFP4
MoE weight quantization
128k
Context window served

gpt-oss-120b runs on a single 80GB GPU (NVIDIA H100 or AMD MI300X). That is a one-card deployment, not the multi-GPU rig a 116B dense model would require. gpt-oss-20b runs within 16GB of memory, which covers high-end consumer GPUs and laptops with enough unified memory, so you can run it without renting a datacenter instance.

For the broader picture of inference cost and throughput, see AI inference and LLM inference optimization.

How to Run GPT-OSS Locally

Both checkpoints load directly into the three common open-weight runtimes. Ollama is the fastest path for a local test. vLLM gives a production OpenAI-compatible server. Transformers gives maximum control for custom pipelines and fine-tuning.

Ollama: one command to run gpt-oss

# gpt-oss-20b runs within 16GB memory (consumer GPU / laptop)
ollama run gpt-oss:20b

# gpt-oss-120b needs a single 80GB GPU
ollama run gpt-oss:120b

vLLM: OpenAI-compatible server

# Serve gpt-oss-20b as an OpenAI-compatible endpoint
vllm serve openai/gpt-oss-20b

# Then call it like any OpenAI chat endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Refactor this function"}],
    "reasoning_effort": "medium"
  }'

Hugging Face Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = [
    {"role": "system", "content": "Reasoning: high"},
    {"role": "user", "content": "Prove the sum of the first n odd numbers is n^2."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

The reasoning-effort level is set in the system prompt or, in vLLM, via the reasoning_effort field. The same checkpoint handles low, medium, and high without reloading.

GPT-OSS vs Other Open Models

gpt-oss is one of several strong open-weight models released in 2025. Its distinguishing trait is the small active-parameter count: 5.13B active on the 120b means it serves cheaply relative to models with 35B or more active parameters, while staying competitive on coding benchmarks.

ModelSWE-Bench VerifiedActive paramsLicense
gpt-oss-120b62.4%5.13BApache 2.0
gpt-oss-20b60.7%3.61BApache 2.0
Kimi K2 Thinking71.3% (tools)32BModified MIT
DeepSeek-V3.2-Exp67.8%MoE (685B total)MIT
MiniMax-M269.4%10BModified MIT

On raw SWE-Bench Verified, larger open models lead: Kimi K2 Thinking at 71.3% (with tools), MiniMax-M2 at 69.4%, and DeepSeek-V3.2-Exp at 67.8%. gpt-oss-120b sits below them at 62.4% but activates 5.13B parameters versus 32B for Kimi K2 and 10B for MiniMax-M2, so it is cheaper per token to serve. On GPQA Diamond, gpt-oss-120b (80.9%) edges DeepSeek-V3.2-Exp (79.9%).

The takeaway: gpt-oss-120b is not the top open model on coding accuracy, but it is one of the most efficient. If you are choosing a coding model on a cost-per-quality basis, compare it against the field in the best open-source coding model in 2026 and the best AI model for coding.

Serving and Routing GPT-OSS

An open-weight model is only useful once it is served. You have two options: self-host the weights on your own GPUs, or access the model through an OpenAI-compatible router that serves it alongside other open models. The second option avoids the operational cost of running and scaling GPU nodes.

Morph builds inference infrastructure for AI coding agents. It serves open models on its fleet, including glm51-754b, qwen35-397b (~120 tok/s), qwen36-27b, and minimax27-230b (~140 tok/s), all through one OpenAI-compatible API at api.morphllm.com/v1. The same surface that serves these models can route to a cheap open model for easy turns and a frontier model for hard ones.

This is where routing pays off with open weights. A small open model like gpt-oss-20b handles boilerplate, renames, and simple edits at a fraction of the cost; only hard turns reach a larger model. Morph's LLM router classifies prompt difficulty in ~430ms (~$0.001 per classification) and routes accordingly, cutting API cost 40-70% with under 2% quality loss on hard tasks. For the gateway pattern, see LLM gateway.

Morph measured: routing economics on the fleet

On Morph's fleet, the router classifies difficulty in ~430ms at ~$0.001 per classification and sends roughly 60% of coding-session prompts (the easy tier) to a small cheap model. Routing the easy and medium turns to open models while reserving a frontier model for the hard 15% cuts API cost 40-70% with under 2% quality loss on hard tasks.

Frequently Asked Questions

Is gpt-oss really open source?

gpt-oss-120b and gpt-oss-20b are released under Apache 2.0, a permissive open-source license. You can download the weights, run them on your own hardware, fine-tune them, redistribute them, and ship them in a commercial product without paying OpenAI or asking permission. There is no revenue cap or acceptable-use clause in the license, unlike Meta's Llama Community License. OpenAI calls them open-weight: the weights are public, but the training data and full training pipeline are not.

What is the difference between gpt-oss-120b and gpt-oss-20b?

gpt-oss-120b has 116.83B total parameters and 5.13B active per token (128 experts, top-4, 36 layers). gpt-oss-20b has 20.91B total and 3.61B active (32 experts, top-4, 24 layers). Both share a 131,072-token context and three reasoning-effort levels. The 120b leads on knowledge tasks (90.0% vs 85.3% MMLU) and needs an 80GB GPU; the 20b runs within 16GB and stays close on math and code.

How many parameters and active parameters does gpt-oss have?

gpt-oss-120b: 116.83B total, 5.13B active per token. gpt-oss-20b: 20.91B total, 3.61B active per token. Both are Mixture-of-Experts models, so only a small fraction of the total weights run on each token, which is what makes them fast for their size.

How do I run gpt-oss locally?

Use Ollama (ollama run gpt-oss:20b or ollama run gpt-oss:120b) for the simplest path, vLLM (vllm serve openai/gpt-oss-20b) for a production OpenAI-compatible server, or Hugging Face Transformers for custom pipelines. MXFP4 quantization on the MoE weights means gpt-oss-20b fits in 16GB and gpt-oss-120b fits on a single 80GB GPU.

How do gpt-oss benchmarks compare to other open models?

gpt-oss-120b scores 62.4% on SWE-Bench Verified and 80.9% on GPQA Diamond. Larger open models lead on coding accuracy (Kimi K2 Thinking 71.3% with tools, MiniMax-M2 69.4%, DeepSeek-V3.2-Exp 67.8%), but gpt-oss-120b activates only 5.13B parameters, so it is cheaper per token to serve. On GPQA Diamond it edges DeepSeek-V3.2-Exp (80.9% vs 79.9%).

What hardware do I need to run gpt-oss-120b?

A single 80GB GPU such as an NVIDIA H100 or AMD MI300X, thanks to MXFP4 quantization on the MoE weights. That is a one-card deployment rather than the multi-GPU setup a 116B dense model would need. gpt-oss-20b is lighter and runs within 16GB.

What are the gpt-oss reasoning-effort levels?

Low, medium, and high, set in the system prompt. High effort spends more tokens on chain-of-thought and produces the published benchmark scores (for example AIME 2025 97.9% on gpt-oss-120b with tools); low effort returns faster and cheaper at a small accuracy cost. One checkpoint covers all three.

Related Resources

Serve and Route Open Models Like GPT-OSS Through One API

Morph serves open models on its fleet through an OpenAI-compatible API at api.morphllm.com. Its router classifies prompt difficulty in ~430ms ($0.001 per classification) and routes easy turns to cheap open models, hard turns to frontier models. 40-70% API cost savings with under 2% quality loss on hard tasks.