Open Source LLMs (2026): What They Are, Top Models, and How to Run Them

An open source LLM is a large language model whose trained weights you can download, inspect, fine-tune, and run on your own hardware instead of only calling a vendor API. Most are open-WEIGHT, not strictly OSI open-source: the weights ship but the training data does not. Leading families in 2026 include Qwen3, DeepSeek, Llama 4, GLM-4.6, Mistral, Kimi K2, gpt-oss, and MiniMax M2. You run them with Ollama locally, vLLM or SGLang for serving, or an OpenAI-compatible router.

69.6%

Qwen3-Coder-480B on SWE-bench Verified

Apache-2.0

Qwen3, Qwen3-Coder, gpt-oss license

1T / 32B

Kimi K2 total / active params

128K-1M

Context window range

What Is an Open Source LLM?

An open source LLM is a large language model whose trained weights are published for download. You can load them into your own serving stack, inspect the architecture, fine-tune on your data, and run inference on your own GPUs. With a closed model like GPT-5 or Claude, you only ever touch an API. The provider holds the weights, sets the rate limits, and bills per token.

This page is the broad overview: what these models are, how the licensing actually works, the major families, and how to run them. If you want a ranked answer to which open model is best for coding, see Best Open Source Coding Model (2026), which scores them head to head.

The term "open source LLM" is used loosely. Almost every model people call open source is more precisely open-weight. The next section draws the line exactly, because it changes what you are legally and practically allowed to do with the model.

Open Source vs Open Weight

The OSI Open Source AI Definition (OSAID) sets a strict bar. A model is open source only if it ships three things: the model parameters and weights, the complete training and inference source code, and sufficiently detailed information about the training data to build a substantially equivalent system.

Open-weight models release the trained parameters for download but typically withhold the training data and sometimes the training code. By the strict OSI definition, almost no leading "open" model qualifies as open source. The models that do meet the standard, like OLMo and Pythia, are not the ones topping benchmark leaderboards.

The practical takeaway: both open-weight and truly open-source models can be self-hosted, inspected, and fine-tuned. The difference is licensing freedom and how much of the training pipeline is disclosed. When you read "open source LLM" on a model card or a leaderboard, assume it means open-weight unless the page explicitly cites the OSI definition.

Property	OSI Open Source	Open Weight	Closed (API)
Weights downloadable	Yes	Yes	No
Training code published	Yes	Sometimes	No
Training data disclosed	Yes	Rarely	No
Self-hostable	Yes	Yes	No
Fine-tunable	Yes	Yes	Limited / no
Example	OLMo, Pythia	Qwen3, DeepSeek, Llama 4	GPT-5, Claude

Why almost nothing is strictly open source

Training data is the missing piece. Releasing trillions of tokens of web, code, and licensed text raises copyright and competitive concerns, so makers publish weights and a model card but not the corpus. That single omission is what drops most "open source" models down to open-weight under the OSI definition.

The Leading Open Model Families

The leading open-weight models in 2026 are predominantly Mixture-of-Experts: large total parameter counts with a small active slice firing per token. That architecture is why a model with hundreds of billions of total parameters can serve at reasonable cost. Below is the landscape, with the maker, latest size, context window, and license for each family.

Model	Maker	Size (total / active)	Context	License
Qwen3-Coder-480B	Alibaba	480B / 35B (MoE)	256K (1M via Yarn)	Apache-2.0
Qwen3-235B-A22B	Alibaba	235B / 22B (MoE)	262K (~1M)	Apache-2.0
DeepSeek-V3.2	DeepSeek-AI	685B (MoE)	128K	MIT
DeepSeek-R1	DeepSeek-AI	671B / 37B (MoE)	128K	MIT
Llama 4 Maverick	Meta	400B / 17B (MoE)	1M	Llama 4 Community
GLM-4.6	Zhipu AI / Z.ai	357B (MoE)	200K	MIT
Kimi K2	Moonshot AI	1T / 32B (MoE)	128K	Modified MIT
Codestral 25.01	Mistral AI	Code specialist	256K	Mistral (MNPL)
gpt-oss-120b	OpenAI	117B / 5.1B (MoE)	128K	Apache-2.0
MiniMax-M2	MiniMax AI	230B / 10B (MoE)	Coding/agentic	Modified MIT
Gemma 3 27B	Google	27B (dense)	128K	Gemma license
Cohere Command A	Cohere	111B (dense)	256K	CC-BY-NC

Three reference points anchor the leaderboard. On SWE-bench Verified, Qwen3-Coder-480B scores 69.6% and MiniMax-M2 scores 69.4%, both state-of-the-art among open models on agentic coding. Kimi K2 reaches 65.8% single-attempt and 71.6% with multiple attempts. DeepSeek-R1 leads open reasoning with a Codeforces rating of 2029 (96.3 percentile) and 65.9 LiveCodeBench Pass@1-COT.

Coding leaders

Qwen3-Coder-480B (69.6% SWE-bench Verified) and MiniMax-M2 (69.4%) top open coding. Kimi K2 hits 71.6% with agentic multi-attempt. GLM-4.6 reports parity with Claude Sonnet on several coding leaderboards per Z.ai docs.

Reasoning leader

DeepSeek-R1: Codeforces 2029 (96.3 percentile), 65.9 LiveCodeBench Pass@1-COT, 49.2% SWE-bench Verified. MIT licensed, which permits commercial use and distillation into other models.

Small / local leader

Gemma 3 27B is dense, 128K context, and fits on a single GPU. Scores 48.8% HumanEval (0-shot) and 65.6% MBPP (3-shot). The Gemma license is custom, not OSI open-source.

Code-completion specialist

Mistral Codestral 25.01 leads fill-in-the-middle with 95.3% pass@1 on HumanEval FIM, 71.4% average on HumanEval, and a 256K context for whole-repo completion.

Why Teams Self-Host

Teams self-host open-weight LLMs to control cost, keep data private, customize via fine-tuning, and optimize inference for their own workloads instead of routing every request through a closed API. The decision is rarely about a single benchmark. It is about who holds the model and the data.

Cost at volume

A self-hosted MoE model has a fixed GPU cost regardless of token volume. Past a crossover point of millions of tokens per day, owning the hardware beats per-token API pricing. Below that point, an API is usually cheaper.

Data privacy

Self-hosting keeps prompts and completions inside your own network. Regulated workloads (healthcare, finance, government) often cannot send data to a third-party API at all, which makes open weights the only option.

Customization

You can fine-tune open weights on your domain, quantize for your hardware, or graft a LoRA adapter. None of this is possible with a closed model you can only call over HTTP.

No rate limits or deprecation

A self-hosted model has no provider-imposed rate limits and never gets deprecated out from under you. The weights you pinned in production stay byte-for-byte identical until you choose to upgrade.

The tradeoff is real

Self-hosting trades a per-token bill for a fixed GPU bill plus operational burden. You own the autoscaling, the OOM debugging, the speculative-decoding tuning, and the uptime. For low or spiky volume, a closed API or a hosted open-model endpoint is cheaper and far less work than running your own fleet.

License Gotchas

The license, not the benchmark, decides whether you can ship a model in your product. Three buckets cover the field, and the gap between them is large.

License	Models	Commercial use	Notable condition
Apache-2.0	Qwen3, Qwen3-Coder, gpt-oss-120b	Yes, freely	Patent grant, attribution
MIT	DeepSeek-R1, DeepSeek-V3.2, GLM-4.6	Yes, freely	Permits distillation into other models
Modified MIT	Kimi K2, MiniMax-M2	Yes	Light added conditions per model card
Llama 4 Community	Llama 4 Maverick	Yes, with limits	Monthly-active-user clause for very large deployers
Gemma license	Gemma 3 27B	Yes, with use policy	Custom terms, not OSI-approved
CC-BY-NC	Cohere Command A	No	Non-commercial only, plus acceptable-use policy

The Llama clause catches large deployers. Llama 4 is released under the Llama 4 Community License, not an OSI-approved license. Its monthly-active-user threshold means a product above the cutoff must request a separate license from Meta. For most teams this never binds, but it is the reason Llama 4 is open-weight rather than open source.

Cohere Command A is the trap. CC-BY-NC bars commercial use entirely. Command A is strong on SQL generation and code translation, but you cannot legally put it inside a product you sell. Apache-2.0 and MIT models carry no such restriction, which is why Qwen3 and DeepSeek dominate commercial open-model deployments.

How to Run an Open Source LLM

There are three serving paths, matched to scale. A laptop runs a quantized model with one command. A production service needs batched, high-throughput inference. And if you do not want to run infrastructure at all, you call open models through an OpenAI-compatible endpoint.

Local: Ollama or llama.cpp

For a single machine, Ollama wraps llama.cpp and pulls a quantized model with one command. This is the fastest way to try an open model and the right choice for offline or privacy-sensitive local use. The tradeoff is throughput: a single quantized model on consumer hardware serves one request at a time, not a fleet.

Run an open model locally with Ollama

# Pull and run a quantized open model on your laptop
ollama pull qwen3
ollama run qwen3 "Write a Python function to parse JSON safely"

# Or call it over the local OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3",
    "messages": [{"role": "user", "content": "Explain MoE routing"}]
  }'

Production: vLLM or SGLang

For a real service, vLLM and SGLang give high-throughput batched inference with continuous batching, paged attention, and tensor parallelism across GPUs. Both expose an OpenAI-compatible HTTP server, so your client code does not change when you swap the backend. SGLang adds cache-aware routing and aggressive speculative decoding for multi-node fleets.

Serve an open model with vLLM

# Start an OpenAI-compatible server for a downloaded open model
vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 262144

# Point any OpenAI client at it: base_url=http://your-host:8000/v1
# model="Qwen/Qwen3-Coder-480B-A35B-Instruct"

The serving choice has real tradeoffs. A 1M-token context window costs memory at serve time, so the practical context limit on your own hardware is usually lower than the model's stated maximum. MoE models need enough GPU memory for all experts even though only a slice fires per token. Plan capacity around total parameters, not active parameters.

Reaching Open Models Through a Router

The third path skips infrastructure entirely. An OpenAI-compatible router or API lets you call open models by changing the base URL and model name, not your client code. This is the right choice when you want open-model economics without operating a GPU fleet, or when you want to mix open and closed models behind one endpoint.

Morph serves open models on its production fleet, including glm51-754b, qwen35-397b (~120 tok/s), and minimax27-230b (~140 tok/s), reachable through one OpenAI-compatible API at api.morphllm.com/v1. Morph's Model Router classifies each prompt's difficulty in ~430ms and routes cheap turns to open models, reserving the frontier model for hard turns, which cuts API cost 40-70% on typical coding workloads.

Call an open model through an OpenAI-compatible API

import OpenAI from 'openai'

// Point the standard OpenAI client at an open-model endpoint
const client = new OpenAI({
  baseURL: 'https://api.morphllm.com/v1',
  apiKey: process.env.MORPH_API_KEY,
})

const response = await client.chat.completions.create({
  model: 'morph-minimax27-230b',   // an open model on Morph's fleet
  messages: [{ role: 'user', content: 'Refactor this module to use DI' }],
})

// Same client, same SDK shape. Only the base URL and model name change.

Router or self-host is the same client code

Because vLLM, SGLang, Ollama, and Morph's API all speak the OpenAI HTTP format, your application code is identical across them. You can prototype against a local Ollama model, ship on a router, and later move to a self-hosted vLLM fleet without rewriting the client. The only change is the base URL and the model string.

Do Open Models Match GPT and Claude?

On coding benchmarks the gap has narrowed sharply. Qwen3-Coder-480B scores 69.6% on SWE-bench Verified and MiniMax-M2 scores 69.4%, both achieving state-of-the-art results among open models on agentic, multi-turn coding without test-time scaling. GLM-4.6 reports performance on par with Claude Sonnet on several coding leaderboards per Z.ai docs.

The honest caveat: closed frontier models still tend to lead on the hardest reasoning and the broadest general-knowledge tasks, and the strictest-OSI open models (OLMo, Pythia) are not the ones at the top of these leaderboards. The open models winning benchmarks are open-weight, with permissive but not always fully open licenses.

Where open models clearly win is cost, privacy, and control. You can self-host, fine-tune, pin a version forever, and route high-volume cheap work to a $0-marginal-cost open model while keeping a frontier model for the hard 15%. That combination is impossible with a closed-only stack.

69.6%

Qwen3-Coder SWE-bench Verified

71.6%

Kimi K2 agentic multi-attempt

2029

DeepSeek-R1 Codeforces rating

2622

gpt-oss-120b Codeforces Elo

Frequently Asked Questions

What is an open source LLM?

An open source LLM is a large language model whose trained weights you can download, inspect, fine-tune, and run on your own hardware instead of only calling a vendor API. Examples include Qwen3 (Apache-2.0), DeepSeek-R1 (MIT), GLM-4.6 (MIT), and Kimi K2 (Modified MIT). In practice most are open-weight rather than strictly OSI open-source, because the weights are released but the training data is not.

What is the difference between open source and open weight?

The OSI Open Source AI Definition requires the weights, the full training and inference code, and enough training-data detail to rebuild a substantially equivalent model. Open-weight models release only the trained parameters and usually withhold the training data, so they fail the strict OSI test. Both kinds can be self-hosted, inspected, and fine-tuned. The difference is licensing freedom and how much of the pipeline is disclosed. Most popular "open source" LLMs are technically open-weight.

What is the best open source LLM?

It depends on the task. For coding, MiniMax-M2 (69.4 on SWE-bench Verified) and Qwen3-Coder-480B (69.6) lead, with Kimi K2 reaching 71.6% under agentic multi-attempt settings. DeepSeek-R1 is a leading open reasoning model (Codeforces 2029, 96.3 percentile). Gemma 3 27B is the strong single-GPU local option. See Best Open Source Coding Model (2026) for the full ranked leaderboard.

Can I use open source LLMs commercially?

Often yes, but read the license. Apache-2.0 (Qwen3, Qwen3-Coder, gpt-oss-120b) and MIT (DeepSeek, GLM-4.6) permit commercial use freely. Modified MIT (Kimi K2, MiniMax-M2) adds light conditions. Llama 4 ships under the Llama 4 Community License with a monthly-active-user clause for very large deployers, Gemma 3 under the Gemma license, and Cohere Command A under CC-BY-NC, which bars commercial use entirely.

How do I run an open source LLM?

Three paths. For a laptop or single GPU, use Ollama or llama.cpp to run a quantized model locally with one command. For production serving, use vLLM or SGLang for high-throughput batched inference with an OpenAI-compatible endpoint. To skip infrastructure, call open models through an OpenAI-compatible router or API; you change the base URL and model name, not your client code.

Do open source LLMs match GPT and Claude on quality?

On coding benchmarks the gap has narrowed sharply. Qwen3-Coder-480B scores 69.6% on SWE-bench Verified and MiniMax-M2 scores 69.4%, neck-and-neck with top closed systems on agentic coding, and GLM-4.6 reports parity with Claude Sonnet on several coding leaderboards. Closed frontier models still tend to lead on the hardest reasoning and broadest general tasks, but open models win on cost, privacy, and the ability to self-host.

What context windows do open source LLMs support?

They range from 128K (DeepSeek-R1, Kimi K2, Gemma 3, gpt-oss-120b) to 256K (Qwen3-Coder, Codestral, Command A) up to 1M tokens (Llama 4 Maverick, and Qwen3 with Yarn extrapolation). Larger context costs more memory at serving time, so the practical limit on your own hardware is usually lower than the model's maximum. See LLM Context Windows for the full comparison.

Related Resources

Run Open Models Without Running a Fleet

Morph serves open models (glm51-754b, qwen35-397b, minimax27-230b) on its production fleet through one OpenAI-compatible API at api.morphllm.com. The Model Router classifies each prompt in ~430ms and routes cheap turns to open models, cutting API cost 40-70%. Change the base URL, not your client code.

Try the Router

View API Docs

Fast Apply

WarpGrep

Compact

Model Router

DeepSeek

MiniMax

Qwen

Blog

Startup Credits

Students

Contact Us

About

Careers

Open Source LLMs (2026): What They Are, the Top Models, and How to Run Them