Kimi K2 and K2 Thinking: Specs, Benchmarks, and API (Moonshot AI)

Kimi K2 is Moonshot AI's open-weight Mixture-of-Experts model with 1 trillion total parameters and 32 billion activated per token. The K2 Thinking variant adds long-horizon reasoning and tool use over a 256K context window, scoring 71.3 on SWE-bench Verified and 83.1 on LiveCodeBench V6. Both ship under a Modified MIT License and call through an OpenAI-compatible API at api.moonshot.ai.

1T / 32B

Total / active params (MoE)

256K

K2 Thinking context window

71.3

SWE-bench Verified (with tools)

83.1

LiveCodeBench V6

Kimi K2 at a Glance

Kimi K2 is a sparse Mixture-of-Experts (MoE) language model. The total parameter count is 1 trillion, but only 32 billion parameters activate on any given token. This is the defining property of MoE: capacity scales with the total count while inference cost scales with the active count. K2 has the knowledge capacity of a trillion-parameter model at roughly the inference cost of a 32B model.

There are two models to keep straight. The original Kimi-K2-Instruct is the direct-answer model with a 128K context window. Kimi K2 Thinking is the reasoning variant: same 1T/32B architecture, a 256K context window, and training for multi-step reasoning with tool calls. When people say "Kimi K2 Thinking," they mean the model that produces a reasoning trace before its answer.

All figures on this page come from Moonshot AI's Hugging Face model cards for Kimi-K2-Instruct and Kimi-K2-Thinking, and from Moonshot's official platform documentation for the API base URL. Numbers that are not in those sources are omitted rather than estimated.

The two Kimi K2 models

Kimi-K2-Instruct: 1T total / 32B active, 128K context, direct answers. Kimi K2 Thinking: 1T total / 32B active, 256K context, explicit reasoning plus interleaved tool calls. K2 Thinking launched November 6, 2025.

Who Makes Kimi (Moonshot AI)

Kimi is built by Moonshot AI, a Chinese AI company. The name "Kimi" and the company's lunar branding are deliberate: Moonshot positions the models as a reach toward a frontier. Kimi started as a long-context chat assistant and grew into the K2 family of open-weight foundation models.

Moonshot AI publishes the K2 weights openly on Hugging Face and serves a hosted API at api.moonshot.ai. This dual distribution, open weights plus a first-party API, is the same model DeepSeek and Alibaba's Qwen team use. It lets teams choose between self-hosting for control and using the hosted endpoint for convenience.

Kimi K2 Thinking was officially launched on November 6, 2025. The original Kimi-K2-Instruct predates it. Moonshot versions individual checkpoints by date, which is the origin of names like "Kimi K2 0905."

Architecture and Parameters

Kimi K2 Thinking is a Mixture-of-Experts transformer. It has 384 total experts, of which 8 are selected per token along with 1 shared expert that always runs. The network is 61 layers deep, with 1 dense layer, and uses 64 attention heads with Multi-head Latent Attention (MLA). The vocabulary is 160K tokens.

The MLA attention mechanism is what makes the 256K context window tractable. MLA compresses the key-value cache, which is the memory bottleneck for long-context inference. Without it, a 256K context on a trillion-parameter model would demand prohibitive memory for the KV cache alone.

K2 Thinking ships with native INT4 weights produced by Quantization-Aware Training (QAT). Rather than quantizing after training (which loses accuracy), QAT trains the model to be robust to INT4 from the start. The model card reports this roughly doubles generation speed compared to the unquantized weights.

Attribute	Value
Total parameters	1 trillion (MoE)
Active parameters	32 billion per token
Total experts	384 (8 selected + 1 shared)
Layers	61 (1 dense)
Attention	64 heads, MLA
Vocabulary	160K
Context window	256K tokens
Quantization	Native INT4 (QAT)

Kimi K2 vs K2 Thinking

The two models share the same MoE backbone. The differences are the context window, the training objective, and the inference behavior. K2 Thinking trades latency (it generates a reasoning trace first) for accuracy on hard, multi-step problems.

Attribute	Kimi-K2-Instruct	Kimi K2 Thinking
Total / active params	1T / 32B	1T / 32B
Context window	128K	256K
Inference style	Direct answer	Reasoning + tool calls
Best for	Latency-sensitive chat, edits	Agentic tasks, hard reasoning, search
License	Modified MIT	Modified MIT
Quantization	Standard weights	Native INT4 (QAT)

The tradeoff is explicit. K2 Thinking is slower per response because it reasons before answering, but it wins on agentic search, hard math, and multi-file coding where a single-pass answer underperforms. For a simple rename or a short chat reply, the base Instruct model is the right call. Reserve the Thinking variant for tasks that actually need the reasoning.

Use Kimi-K2-Instruct when

The task is a direct edit, a short answer, or a latency-sensitive chat turn. 128K context is enough and you do not want to pay for a reasoning trace.

Use Kimi K2 Thinking when

The task is agentic (tool use, search), spans many files, or requires multi-step math or debugging. The 256K window and reasoning trace earn their cost on hard work.

Benchmarks

The numbers below are from the Kimi K2 Thinking model card. Coding and agentic benchmarks are the ones that matter for AI coding tools, so they lead.

Benchmark	Score	Notes
SWE-bench Verified	71.3	With tools
LiveCodeBench V6	83.1	Code generation
AIME 2025	94.5	No tools, competition math
GPQA	84.5	No tools, graduate science
Humanity's Last Exam	44.9	With tools, hardest reasoning set
BrowseComp	60.2	Agentic web search
BrowseComp-ZH	62.3	Agentic web search (Chinese)

The 71.3 on SWE-bench Verified is the headline for coding. SWE-bench Verified measures resolving real GitHub issues, so it tracks practical engineering ability more closely than synthetic code-completion tests. The 83.1 on LiveCodeBench V6 confirms the model is strong on fresh, contamination-resistant coding problems.

The agentic search scores (44.9 on Humanity's Last Exam with tools, 60.2 on BrowseComp) are where the "Thinking" training shows its value. These benchmarks require planning multi-step tool-use trajectories, which a direct-answer model cannot do well. This is the workload K2 Thinking was built for.

Kimi K2 vs DeepSeek

Kimi K2 and DeepSeek are the two most prominent open-weight MoE families from Chinese labs. Both can be self-hosted, both have first-party APIs, and both compete with closed frontier models on coding. The differences are parameter scale and benchmark results.

Attribute	Kimi K2 Thinking	DeepSeek-V3.2-Exp
Total parameters	1 trillion	685 billion
Active parameters	32B	Not stated on card
Context window	256K	160K (163,840)
SWE-bench Verified	71.3 (with tools)	67.8
LiveCodeBench	83.1 (V6)	74.1
AIME 2025	94.5 (no tools)	89.3
GPQA-Diamond	84.5 (no tools)	79.9
License	Modified MIT	MIT

On the model-card numbers, Kimi K2 Thinking edges DeepSeek-V3.2-Exp on every shared coding and reasoning benchmark listed: 71.3 vs 67.8 on SWE-bench Verified, 83.1 vs 74.1 on LiveCodeBench, 94.5 vs 89.3 on AIME 2025. The caveat is that benchmark conditions differ (Kimi reports several scores with tools), so direct comparison favors reading both model cards rather than a single ranking.

DeepSeek's advantage is a smaller total footprint (685B vs 1T) and the MIT license without the modified-attribution clause. For self-hosting on constrained hardware, the smaller model is cheaper to serve. See the best open-source coding models of 2026 for a fuller cross-model comparison.

How to Call the Kimi API

Moonshot AI serves Kimi models through an OpenAI-compatible API. The base URL is https://api.moonshot.ai/v1, confirmed from Moonshot's official platform documentation. Because the request and response schema match OpenAI's, any client built for the OpenAI SDK works by changing the base URL and the API key.

Calling Kimi K2 Thinking via the OpenAI-compatible API

import OpenAI from "openai";

// Moonshot AI exposes an OpenAI-compatible endpoint.
const client = new OpenAI({
  apiKey: process.env.MOONSHOT_API_KEY,
  baseURL: "https://api.moonshot.ai/v1",
});

const response = await client.chat.completions.create({
  model: "kimi-k2-thinking",
  messages: [
    { role: "system", content: "You are a senior engineer." },
    { role: "user", content: "Refactor this module to use dependency injection." },
  ],
});

console.log(response.choices[0].message.content);

Pricing not stated here

Moonshot AI's official platform pricing page did not publish a verifiable per-token price for the Kimi K2 Thinking API at the time this page was written. Rather than quote an unconfirmed number, check the current rate on Moonshot's pricing page directly. The base URL and OpenAI-compatible schema above are confirmed from Moonshot's documentation.

The same pattern works from Python: instantiate the OpenAI client, set base_url to https://api.moonshot.ai/v1, and call chat.completions.create with the Kimi model name. No SDK swap is needed because the wire format is OpenAI-compatible.

License and Open Weights

Kimi K2 and K2 Thinking are released under a Modified MIT License. The base MIT permissions (use, modify, distribute, including commercially) hold, with modifications around attribution. The weights are downloadable from Hugging Face, so the models are genuinely open weight rather than API-only.

Open weights mean you can self-host. For a team that needs data residency, custom serving, or freedom from a single vendor's rate limits, downloading and serving the weights is an option that closed models do not offer. The tradeoff is that running a 1T-parameter MoE model requires substantial GPU memory even with INT4 quantization and MoE sparsity.

What Modified MIT permits

Commercial use is permitted. You can fine-tune, redistribute, and deploy the model in products. The "modified" clauses concern attribution conditions. Review the exact license text on the Hugging Face model card before shipping a commercial deployment.

Running Kimi-Class Models Through a Router

A Kimi-class model is the right tool for hard, agentic turns. It is the wrong tool for adding a TODO comment or renaming a variable. Sending every request to a frontier MoE model wastes money on the majority of prompts that a smaller model handles identically.

Morph runs an OpenAI-compatible model router that classifies each prompt's difficulty in ~430ms into four tiers (easy, medium, hard, needs_info) and routes to the cheapest model that can handle it. The classification costs about $0.001 per request and cuts API spend 40-70% by reserving the expensive model for the hard turns. You mix a Kimi-class model with cheaper models per turn instead of paying frontier prices for boilerplate.

Morph also serves frontier-scale MoE models on its own fleet through one OpenAI-compatible API at api.morphllm.com. On Morph's fleet, the 230B-parameter minimax27-230b serves at ~140 tok/s and the 754B-parameter glm51-754b is available alongside it. That lets a router point hard turns at a large open-weight MoE model and easy turns at a small one, all behind a single endpoint.

One OpenAI-compatible API

Access many models, including frontier-scale MoE models, through a single endpoint at api.morphllm.com. No per-provider SDK juggling.

Route by difficulty

The router classifies each prompt in ~430ms and sends easy turns to cheap models, hard turns to a Kimi-class model. 40-70% cost savings.

Mix per turn

A single agent loop can use a large reasoning model for planning and a small model for execution turns, switching automatically per request.

Frequently Asked Questions

How many parameters is Kimi K2?

Kimi K2 is a Mixture-of-Experts model with 1 trillion total parameters and 32 billion activated per token. It has 384 total experts (8 selected plus 1 shared per token) across 61 layers. Only the activated 32B fire on a given token, so inference cost tracks a 32B model rather than the full trillion.

What is the difference between Kimi K2 and Kimi K2 Thinking?

Both share the 1T total / 32B active MoE architecture. The original Kimi-K2-Instruct has a 128K context window and answers directly. Kimi K2 Thinking extends context to 256K tokens, is trained for long-horizon reasoning with tool calls, and produces an explicit reasoning trace before its answer. K2 Thinking scores 71.3 on SWE-bench Verified with tools.

What is the Kimi K2 context window?

Kimi K2 Thinking has a 256K token context window. The original Kimi-K2-Instruct model has a 128K token context window. Both figures are from Moonshot AI's Hugging Face model cards.

How do I use the Kimi API?

Moonshot AI serves Kimi models through an OpenAI-compatible API at base URL https://api.moonshot.ai/v1. Point any OpenAI SDK client at that base URL, supply your Moonshot API key, and call chat completions with the Kimi model name. Existing OpenAI-shaped code works with only a base-URL and key change.

Is Kimi K2 open source?

Kimi K2 and K2 Thinking are released as open weights under a Modified MIT License by Moonshot AI. The license permits commercial use with attribution conditions. The weights are downloadable from Hugging Face, so the models can be self-hosted rather than used only via the hosted API.

How does Kimi K2 compare to DeepSeek?

Kimi K2 has 1T total / 32B active parameters; DeepSeek-V3.2-Exp has 685B total. On SWE-bench Verified, Kimi K2 Thinking scores 71.3 (with tools) versus DeepSeek-V3.2-Exp's 67.8. On LiveCodeBench, Kimi K2 Thinking scores 83.1 (V6) versus 74.1. Both are MoE models under permissive open-weight licenses.

What is Kimi K2 0905 and Kimi K2.5?

Kimi K2 0905 refers to a dated checkpoint release of the Kimi K2 line; Moonshot AI versions model snapshots by date. Newer Kimi checkpoints follow the same convention. This page reports the verified Kimi-K2-Instruct (1T/32B, 128K) and Kimi K2 Thinking (1T/32B, 256K) model cards.

Related Resources

Mix Kimi-Class Models With Cheaper Ones Per Turn

Morph's router classifies prompt difficulty in ~430ms and routes each turn to the right model tier through one OpenAI-compatible API at api.morphllm.com. Reserve a frontier MoE model for hard turns, send boilerplate to a cheap model, and cut API spend 40-70%.

Try the Router

View API Docs

Fast Apply

WarpGrep

Compact

Model Router

DeepSeek

MiniMax

Qwen

Blog

Startup Credits

Students

Contact Us

About

Careers