The Best Open Source LLMs (2026): Ranked by Benchmark, Size, and Use Case

The best open source LLM for coding right now is Qwen3-Coder-480B (69.6% SWE-bench Verified, Apache-2.0). DeepSeek-V3.2 hits ~70% under MIT, MiniMax-M2 scores 69.4%, and Kimi K2 reaches 71.6% under agentic multi-attempt. This page ranks the best open-weight models overall and by category, each with the verified score that justifies the pick.

June 18, 2026 · 2 min read
The Best Open Source LLMs (2026): Ranked by Benchmark, Size, and Use Case

The best open source LLM for coding right now is Qwen3-Coder-480B-A35B at 69.6% on SWE-bench Verified under Apache-2.0. DeepSeek-V3.2 reaches roughly 70% under MIT, MiniMax-M2 scores 69.4%, and Kimi K2 hits 71.6% under agentic multi-attempt. Most of these are open-weight, not OSI open-source. Morph serves several of them on its fleet through one OpenAI-compatible API.

69.6%
Qwen3-Coder SWE-bench Verified
71.6%
Kimi K2 SWE-bench (multi-attempt)
2029
DeepSeek-R1 Codeforces rating
1M
Llama 4 Maverick context window

The Best Open Source LLMs, Ranked

This is the ranked shortlist of the best open-weight LLMs for general and coding use, each with the verified benchmark that justifies the pick. Sources are each model's Hugging Face card or maker announcement.

  1. Qwen3-Coder-480B-A35B (Alibaba). Best overall for coding. 69.6% on SWE-bench Verified, state-of-the-art among open models without test-time scaling. Mixture-of-Experts, 480B total / 35B active, 256K context (extendable to 1M via Yarn), Apache-2.0.
  2. DeepSeek-V3.2 (DeepSeek-AI). Best permissive-license generalist. Roughly 70% on SWE-bench Verified, 15.56% on SWE-bench Pro. 685B total parameters, 128K context, standard MIT License.
  3. Kimi K2 (Moonshot AI). Best agentic coder under multi-attempt. 65.8% SWE-bench Verified single attempt, 71.6% with multiple attempts; 53.7% LiveCodeBench v6. 1T total / 32B active, 128K context, Modified MIT License.
  4. GLM-4.6 (Zhipu AI / Z.ai). Best Claude-Sonnet alternative. Z.ai reports parity with Claude Sonnet 4 on several coding leaderboards; 9.67 SWE-bench Pro, 24.5 Terminal-Bench 2.0. 357B total, 200K context, MIT.
  5. MiniMax-M2 (MiniMax AI). Best for tool-use agents. 69.4% SWE-bench Verified, 46.3 Terminal-Bench, 83 LiveCodeBench. 230B total / 10B active, Modified MIT License.
  6. Llama 4 Maverick (Meta). Best long-context. 1M-token context window, 80.5 MMLU Pro, 43.4 LiveCodeBench. 400B total / 17B active, 128 experts. Released under the Llama 4 Community License (not OSI open-source).
  7. gpt-oss-120b (OpenAI). Best competition coder. 2622 Elo on Codeforces, matching or exceeding o4-mini; 16.2 SWE-bench Pro. 117B total / 5.1B active, 128K context, Apache-2.0.
  8. Mistral Codestral 25.01 (Mistral AI). Best code-completion specialist. 95.3% pass@1 on HumanEval FIM (fill-in-the-middle), 71.4% HumanEval average, 256K context.

How to read these scores

SWE-bench Verified measures real GitHub-issue resolution in a multi-turn agentic loop, the closest proxy for production coding ability. Codeforces Elo measures competition algorithm coding. HumanEval and MBPP measure isolated function generation. LiveCodeBench is contamination-resistant. A model can lead one benchmark and trail another; rank by the benchmark closest to your workload.

Comparison Table: Size, License, Benchmark

Every model side by side: total parameters, license, context window, the single benchmark it leads on, and the category it wins. Active-parameter counts (the driver of inference cost) are noted where the model is Mixture-of-Experts.

ModelSize (total / active)LicenseBest benchmarkBest for
Qwen3-Coder-480B480B / 35BApache-2.069.6% SWE-bench VerifiedCoding overall
DeepSeek-V3.2685BMIT~70% SWE-bench VerifiedPermissive generalist
Kimi K21T / 32BModified MIT71.6% SWE-bench (multi)Agentic coding
GLM-4.6357BMIT9.67 SWE-bench ProClaude-Sonnet alt
MiniMax-M2230B / 10BModified MIT69.4% SWE-bench VerifiedTool-use agents
DeepSeek-R1671B / 37BMIT2029 CodeforcesReasoning
Llama 4 Maverick400B / 17BLlama 4 Community80.5 MMLU ProLong context (1M)
gpt-oss-120b117B / 5.1BApache-2.02622 Codeforces EloCompetition coding
Gemma 3 27B27B (dense)Gemma license48.8% HumanEvalSmall / local
Codestral 25.01code specialistMistral95.3% HumanEval FIMCode completion

Context windows vary widely. DeepSeek-R1, Kimi K2, Gemma 3, and gpt-oss-120b cap at 128K. Qwen3-Coder, Codestral, and Cohere Command A reach 256K. Llama 4 Maverick reaches 1M, and Qwen3-235B extends to roughly 1M via Yarn extrapolation. GLM-4.6 expanded from 128K to 200K. Longer context costs more memory per request, so pick the smallest window that fits your repository.

Best for Coding

For agentic coding, rank by SWE-bench Verified, which measures resolving real GitHub issues through a multi-turn tool loop. Among open models, Qwen3-Coder-480B (69.6%) and MiniMax-M2 (69.4%) lead on single-attempt scores, with Kimi K2 reaching 71.6% under agentic multi-attempt settings.

Qwen3-Coder-480B: 69.6%

Highest open single-attempt SWE-bench Verified without test-time scaling. Apache-2.0, 256K context (1M via Yarn), 480B total / 35B active. The default pick when license freedom and raw coding ability both matter.

Kimi K2: 71.6%

65.8% single attempt, 71.6% with multiple attempts on SWE-bench Verified. 53.7% LiveCodeBench v6, 85.7% MultiPL-E. 1T total / 32B active. Strongest when your agent can retry.

DeepSeek-V3.2: ~70%

Roughly 70% SWE-bench Verified and 15.56% on the harder SWE-bench Pro, under the standard MIT License. 685B total parameters, 128K context. The cleanest license-to-performance ratio.

For fill-in-the-middle completion (the autocomplete pattern inside editors), Mistral Codestral 25.01 leads at 95.3% pass@1 on HumanEval FIM, with a 256K context window. It is a code-specialist model, not a general agent, so use it for completion rather than multi-step planning.

For the coding-specific deep dive, including how these models pair with a fast-apply edit model, see Best Open Source Coding Model 2026 and Best AI Model for Coding.

Best Small / Local (<=27B)

The leaders above need multiple high-memory GPUs. For a model that runs on a single GPU or a workstation, the pick is Gemma 3 27B: a dense 27B model with a 128K context window, scoring 48.8% on HumanEval (0-shot) and 65.6% on MBPP (3-shot).

Dense matters here. Gemma 3 27B activates all 27B parameters per token, which is simpler to serve than a Mixture-of-Experts model and predictable on a single device. The tradeoff is the custom Gemma license, which is not OSI-approved; read it before commercial deployment.

On Morph's fleet, the smaller served model qwen36-27b runs alongside the large MoE models, so a router can drop routine turns to a 27B-class model and reserve the 200B+ models for hard work. This is the same size class as Gemma 3 27B, served behind one API.

Small model tradeoff

A 27B model resolves far fewer SWE-bench issues than a 480B model. Gemma 3 27B is for local privacy, low cost, and offline use, not for matching the agentic-coding leaders. If you need 69%+ SWE-bench Verified, you need one of the large MoE models and the hardware (or hosted API) to run it.

Best Reasoning

For step-by-step reasoning and competition-grade problem solving, DeepSeek-R1 is the leading open model. It holds a Codeforces rating of 2029 (96.3 percentile) and scores 65.9 on LiveCodeBench (Pass@1-COT), while resolving 49.2% on SWE-bench Verified. It ships under the MIT License, which explicitly permits commercial use and distillation for training other LLMs.

DeepSeek-R1 is 671B total parameters with 37B activated, 128K context. The MIT distillation clause is why so many smaller reasoning models are trained on R1 outputs: the license allows it. If you need an open reasoning model whose outputs you can legally build on, R1 is the answer.

For competition coding specifically, gpt-oss-120b reaches 2622 Elo on Codeforces, matching or exceeding OpenAI o4-mini, under Apache-2.0. It is smaller (117B total / 5.1B active) and cheaper to serve than R1 while leading on algorithmic contests.

Best for Agentic / Tool Use

Agentic workloads (multi-file edits, run-fix loops, terminal commands) reward models tuned for tool calling and long action sequences. MiniMax-M2 is built for exactly this: 69.4% SWE-bench Verified, 46.3 on Terminal-Bench, 83 on LiveCodeBench, with explicit support for multi-file edits and coding-run-fix loops. It is 230B total / 10B active, the lowest active-parameter count among the coding leaders, which keeps per-token inference cost down.

MiniMax-M2

69.4% SWE-bench Verified, 46.3 Terminal-Bench, 83 LiveCodeBench. 230B / 10B active, Modified MIT. Lowest active-parameter count among coding leaders, so cheapest per token. Built for multi-file edits and run-fix loops.

GLM-4.6

9.67 SWE-bench Pro, 24.5 Terminal-Bench 2.0, 200K context, MIT. Z.ai reports parity with Claude Sonnet 4 on several coding leaderboards. Strong default when you want a permissive license and Sonnet-class agentic behavior.

Morph measured these models in production agent loops: on its fleet, qwen35-397b serves at ~120 tok/s and minimax27-230b at ~140 tok/s. Throughput at this class of model is the practical bottleneck for agentic loops, where a single task fires dozens of tool calls. A faster open model that resolves the issue in fewer, faster turns often beats a marginally more accurate but slower one.

Open Weight vs Open Source

Most models called open source are open-WEIGHT. The OSI Open Source AI Definition requires three components: the model parameters, the complete training and inference source code, and training-data information detailed enough to rebuild a substantially equivalent system. Releasing only the weights does not meet that bar.

By the strict OSI definition, almost no leading open model qualifies as open source. The models that do meet the standard (OLMo, Pythia) are not the ones topping benchmark leaderboards. The benchmark leaders on this page are open-weight, not OSI open-source.

For most teams the distinction is academic. Both open-weight and truly open-source models can be self-hosted, inspected, and fine-tuned. The difference is licensing freedom and how much of the training pipeline is disclosed. Teams self-host open-weight LLMs to control cost, keep data private, customize via fine-tuning, and optimize inference for their own workloads instead of sending data to a closed API. See Open Source LLMs: What They Are and How to Run Them for the full overview.

Tradeoffs: License and Hardware

Three of the strongest models carry licenses that restrict deployment. State the downside plainly before you build on them.

ModelLicenseCommercial useNote
Qwen3-Coder, gpt-oss-120bApache-2.0Yes, unrestrictedMost permissive common license
DeepSeek-R1/V3.2, GLM-4.6MITYes, unrestrictedR1 explicitly allows distillation
Kimi K2, MiniMax-M2Modified MITYes, with conditionsRead the modifications before shipping
Llama 4 MaverickLlama 4 CommunityConditionalNot OSI-approved; acceptable-use terms apply
Gemma 3 27BGemma licenseConditionalCustom, not OSI-approved
Cohere Command ACC-BY-NCNoNon-commercial only

Cohere Command A is a capable 111B model with a 256K context and strong SQL generation, but CC-BY-NC forbids commercial use. It is a research-and-evaluation model, not a production one. Llama 4 and Gemma 3 ship under custom licenses with acceptable-use clauses; they are usable commercially in most cases but are not OSI open-source and require reading the terms.

Hardware is the other tradeoff. The 200B+ Mixture-of-Experts leaders need multiple high-memory GPUs and a serving stack (vLLM or SGLang). Running the best open model is not free even when the weights are. Self-hosting Kimi K2 (1T total) or DeepSeek-V3.2 (685B) is a real infrastructure project. The cheaper path is a hosted inference API that amortizes the hardware across many users.

How to Run the Best Open Source LLM

Three deployment paths, by model size. Small dense models run locally; large MoE models need a serving stack or a hosted API.

Local (single GPU)

Gemma 3 27B and other models at or below 27B run on one high-memory GPU via Ollama, vLLM, or llama.cpp. Best for privacy, offline use, and low volume. Accept the lower benchmark scores.

Self-hosted cluster

The 200B+ MoE leaders (Qwen3-Coder 480B, DeepSeek 685B, Kimi K2 1T) need multiple GPUs and vLLM or SGLang. Full control, but a real infrastructure and on-call cost.

Hosted API

Skip the hardware entirely. A router picks the best open model per task and serves it behind one OpenAI-compatible endpoint. Morph serves glm51-754b, qwen35-397b, and minimax27-230b on its fleet.

A router is the efficient default when you want the best open model per task without owning the GPUs. It classifies each prompt and sends easy turns to a cheap small model and hard turns to a large one, which is how a 27B-class model and a 200B+ model coexist behind a single API. Morph's router classifies prompt difficulty in ~430ms into four tiers for 40-70% API cost savings. See What Is an LLM Router.

Call open models through one OpenAI-compatible API

import OpenAI from "openai"

// Morph's API is OpenAI-compatible; point the SDK at it.
const client = new OpenAI({
  apiKey: process.env.MORPH_API_KEY,
  baseURL: "https://api.morphllm.com/v1",
})

// Served open-weight models on the fleet:
//   glm51-754b, qwen35-397b (~120 tok/s),
//   qwen36-27b, minimax27-230b (~140 tok/s), dsv4flash
const res = await client.chat.completions.create({
  model: "morph-minimax27-230b",
  messages: [{ role: "user", content: "Refactor this module to use dependency injection." }],
})

console.log(res.choices[0].message.content)

Frequently Asked Questions

What is the best open source LLM right now?

It depends on the task. For agentic coding, Qwen3-Coder-480B-A35B leads at 69.6% on SWE-bench Verified under Apache-2.0, with DeepSeek-V3.2 at roughly 70% under MIT and Kimi K2 reaching 71.6% under agentic multi-attempt. For reasoning, DeepSeek-R1 holds a Codeforces rating of 2029 (96.3 percentile). For small/local use, Gemma 3 27B runs on a single GPU.

What is the best open source LLM for coding?

Qwen3-Coder-480B-A35B scores 69.6% on SWE-bench Verified, the strongest open result without test-time scaling, under Apache-2.0 with a 256K context window. DeepSeek-V3.2 reaches about 70% under MIT, MiniMax-M2 scores 69.4%, and Kimi K2 hits 71.6% under agentic multi-attempt. For fill-in-the-middle completion, Mistral Codestral 25.01 leads at 95.3% pass@1 on HumanEval FIM.

What is the best small or local open source LLM?

Gemma 3 27B: a dense 27B model with a 128K context window that runs on a single GPU, scoring 48.8% on HumanEval (0-shot) and 65.6% on MBPP (3-shot). It ships under the custom Gemma license, which is not OSI-approved, so check the terms before commercial deployment.

Are open source LLMs as good as GPT-5 or Claude?

On coding benchmarks the gap has nearly closed. Qwen3-Coder hits 69.6% on SWE-bench Verified and Kimi K2 reaches 71.6% under agentic multi-attempt, neck-and-neck with top closed systems on multi-turn agentic coding. Z.ai reports GLM-4.6 performing on par with Claude Sonnet 4 on several coding leaderboards. The closed frontier still leads on the hardest reasoning, but open models now win on cost and control.

Which open source LLM license is best for commercial use?

Apache-2.0 is the most permissive common license, used by Qwen3, Qwen3-Coder, and gpt-oss-120b. The standard MIT License (DeepSeek-R1, DeepSeek-V3.2, GLM-4.6) is equally safe. Avoid Cohere Command A (CC-BY-NC forbids commercial use), and read the custom terms on Llama 4 (Llama 4 Community License) and Gemma 3 (Gemma license) before shipping.

How do I run the best open source LLM?

Small models like Gemma 3 27B run on a single high-memory GPU. The 200B+ Mixture-of-Experts leaders (Qwen3-Coder 480B, DeepSeek 671B+, Kimi K2 1T) need multiple GPUs and a serving stack like vLLM or SGLang. To skip the hardware, use a hosted inference API: Morph serves glm51-754b, qwen35-397b (~120 tok/s), and minimax27-230b (~140 tok/s) on its fleet through one OpenAI-compatible endpoint.

Why are most open source LLMs actually open weight?

The OSI Open Source AI Definition requires the weights, the full training and inference code, and detailed training-data information. Almost no benchmark-topping model releases all three; they release weights only, making them open-weight. The practical difference is licensing freedom: open-weight models under Apache-2.0 or MIT can still be self-hosted, inspected, fine-tuned, and used commercially.

Related Resources

Run the Best Open Model Per Task, Without the GPUs

Morph's router classifies each prompt in ~430ms and serves the best open model for it (glm51-754b, qwen35-397b at ~120 tok/s, minimax27-230b at ~140 tok/s) through one OpenAI-compatible API. 40-70% cost savings. No cluster to run.