Kimi K2.5 Agent Swarm: 100 Parallel Sub-Agents, Benchmarks, and Visual Coding (2026)

Summary

Kimi K2.5 at a Glance (Jan 2026)

What: 1T parameter open-weight model with 32B activated (MoE), 256K context, multimodal vision+text+code
Key innovation: Agent Swarm via PARL training, up to 100 parallel sub-agents, 1,500 coordinated steps
Benchmarks: 76.8% SWE-bench Verified, 74.9% BrowseComp (78.4% with Swarm), 85.0% LiveCodeBench
Cost: ~$0.81/M tokens blended, open weights under Modified MIT License

76.8%

SWE-bench Verified

100

Max parallel sub-agents

4.5x

Wall-clock speedup vs sequential

Total parameters (32B activated)

Sequential execution is the bottleneck in AI coding agents. An agent that searches, reads, searches again, and reads again wastes most of its time waiting. Kimi K2.5 addresses this at the model level: the model itself learns to decompose tasks and run sub-agents in parallel, rather than relying on application-layer orchestration.

The result is a model that completes complex research and coding tasks 3x to 4.5x faster than sequential execution, while matching or exceeding proprietary models on agentic benchmarks. It does this with open weights at a fraction of proprietary API costs.

Benchmarks

Kimi K2.5's benchmark profile is unusual. It leads on agentic tasks (BrowseComp, HLE) while remaining competitive on coding (SWE-bench) and strong on vision (MMMU Pro). Most models optimize for one category. K2.5 is competitive across all three.

Kimi K2.5 Benchmark Scores

Benchmark	Kimi K2.5	Best Competitor	Notes
SWE-bench Verified	76.8%	Claude Opus 4.5: 72.7%	Open-source SOTA
BrowseComp	74.9% (78.4% Swarm)	GPT-5.2: 57.8%	+17.1pp over GPT-5.2
HLE (full set)	50.2%	Claude Opus 4.5: 26.6%	Global SOTA
LiveCodeBench	85.0%	GPT-5.2: 64.0%	+21pp gap
MMMU Pro (vision)	78.5%	GPT-5.2: 72.5%	Open-source SOTA
VideoMMMU	86.6%	Gemini 3 Pro: 81.7%	Video understanding
SWE-bench Multilingual	73.0%	--	Non-English codebases

Benchmark Context

These are Moonshot AI's reported numbers from their technical report (arxiv 2602.02276). Independent reproduction may vary. BrowseComp Swarm mode uses a main agent with max 15 steps and sub-agents with max 100 steps each. SWE-bench Verified scores are on the standard 500-problem subset.

Agent Swarm Architecture

Most multi-agent systems are built at the application layer. You write an orchestrator that calls the same model multiple times with different prompts. The model itself has no concept of parallelism. Kimi K2.5 is different: the parallel decomposition behavior is trained directly into the model weights via PARL.

How It Works

A trainable orchestrator agent decomposes incoming tasks into parallelizable subtasks. Each subtask is assigned to a dynamically instantiated sub-agent. Sub-agents are frozen copies of the base model, each with independent tool access (search, code execution, file I/O). The orchestrator manages task assignment and result aggregation. Up to 100 sub-agents run concurrently across up to 1,500 coordinated steps.

Task Decomposition

The orchestrator analyzes the input task and identifies subtasks that can run independently. No predefined roles or hand-crafted workflows. The decomposition is learned from training.

Parallel Execution

Up to 100 frozen sub-agents execute simultaneously, each with its own context window and tool access. Sub-agents search, generate, analyze, and organize information independently.

Result Aggregation

The orchestrator collects sub-agent outputs, resolves conflicts, and synthesizes a final result. The 80% runtime reduction comes from parallelizing the search-read-search-read cycle.

Performance Gains

Agent Swarm reduces the minimum critical steps to reach target performance by 3x to 4.5x compared to single-agent execution. In Moonshot's internal evaluations, this translates to an 80% reduction in end-to-end runtime for wide-search tasks like research, data collection, and codebase analysis.

The 4.5x speedup is not from running the model faster. It is from eliminating sequential dependencies. When 100 sub-agents each search a different part of the problem space simultaneously, the wall-clock time approaches that of a single search rather than 100 sequential searches.

PARL: Parallel-Agent Reinforcement Learning

Training a model to orchestrate parallel agents is harder than it sounds. The core challenge is credit assignment: when 100 agents contribute to a final result, which agents helped and which wasted compute? Standard RL struggles with this.

The Serial Collapse Problem

Without explicit parallelism incentives, the orchestrator defaults to running one agent at a time. Moonshot calls this "serial collapse." It is the safe strategy from the RL reward signal's perspective: sequential execution is easier to credit-assign, so the model learns to avoid parallelism even when it would be faster.

How PARL Solves It

PARL uses staged reward shaping with two phases:

Early training: A reward_parallel signal explicitly incentivizes sub-agent instantiation and concurrent scheduling. This forces the model to explore parallel execution strategies rather than collapsing to sequential behavior.
Late training: The parallelism reward is gradually reduced, shifting focus entirely to task success. By this point, the model has learned that parallelism produces better outcomes faster, and maintains the behavior without explicit incentive.

The orchestrator is trainable while sub-agents are frozen. This simplifies the training problem: only one component needs to learn, and the sub-agents provide a stable execution environment.

Visual Coding: Screenshots to Working Code

Kimi K2.5 is a native multimodal model trained on ~15 trillion mixed visual and text tokens. The "visual coding" capability converts screenshots, wireframes, Figma mockups, and video demonstrations into production frontend code.

The Feedback Loop

The process is not one-shot generation. K2.5 generates code from a visual input, renders the output, compares against the original design, identifies discrepancies, and generates corrections. This visual debugging loop continues autonomously until the output meets quality thresholds. No human in the loop.

Visual Input Processing

Accepts UI screenshots, wireframes, video walkthroughs, and design mockups. Recognizes layout structure, component hierarchies, styling patterns, and interaction behaviors from visual input alone.

Code Generation and Iteration

Generates production React/HTML with responsive design, animations (parallax, scroll-triggered effects, transitions), and cross-browser compatibility. Then renders, compares, and iterates until visual fidelity matches the input.

Visual Understanding Benchmarks

Benchmark	Kimi K2.5	Category
MMMU Pro	78.5%	Visual reasoning
VideoMMMU	86.6%	Video understanding
LiveCodeBench	85.0%	Competitive programming

Moonshot describes this as lowering the threshold for intent communication. Instead of writing detailed specifications, you circle a spot on a screenshot and say "this isn't right." The model interprets the visual context and generates the fix.

Model Architecture

Kimi K2.5 is built on top of Kimi-K2-Base via continual pretraining on approximately 15 trillion mixed visual and text tokens.

Technical Specifications

Parameter	Value
Total parameters	1 trillion
Activated parameters	32 billion (MoE)
Architecture	61 layers (1 dense + 60 MoE)
Attention hidden dim	7,168
MoE hidden dim	2,048 per expert
Context window	256K tokens
Modalities	Text, images, video
License	Modified MIT
Weights	HuggingFace: moonshotai/Kimi-K2.5

The mixture-of-experts architecture means only 32B of the 1T parameters activate per forward pass. This keeps inference costs low relative to the total parameter count. The model supports both "instant" and "thinking" modes, as well as conversational and agentic paradigms.

How It Compares

Kimi K2.5 occupies an unusual position: open-weight model with frontier-class agentic performance. Most open models trade off either scale or capability.

Kimi K2.5 vs Proprietary Models

Metric	Kimi K2.5	Claude Opus 4.5	GPT-5.2
SWE-bench Verified	76.8%	72.7%	74.9%
BrowseComp	74.9% (78.4% Swarm)	65.8%	57.8%
HLE (full)	50.2%	26.6%	--
Open weights	Yes (Modified MIT)	No	No
Parallel agents	100 (model-level)	Agent Teams (app-level)	No native support
Cost (per 1M tok)	~$0.81 blended	$5/$25 in/out	Varies
Context window	256K	200K (1M beta)	128K

Important Caveats

Benchmark comparisons use scores reported by each model's developers. Independent benchmarks may differ. Claude Opus 4.5 was the comparison target at K2.5's launch (Jan 2026). Claude Opus 4.6 and GPT-5.3 have since been released with higher scores on several benchmarks. K2.5's BrowseComp and HLE leads remain significant as of March 2026.

What This Means for Coding Agents

Kimi K2.5's Agent Swarm validates a thesis we have been writing about: intelligence organizes into hierarchies under resource constraints. When a single agent cannot fit the full problem into its context window, the efficient solution is to spawn specialist sub-agents with dedicated context per subtask.

Anthropic reported a 90% improvement from multi-agent approaches. Cognition measured 60% of agent time spent on search overhead. Kimi K2.5 attacks this from the training side rather than the infrastructure side. The model itself learns to parallelize, rather than relying on external orchestration.

The Subagent Paradigm

Three approaches to multi-agent coding are now live in production:

Application-layer orchestration: Claude Code Agent Teams, Codex multi-thread. The application spawns multiple model calls and coordinates them. The model has no awareness of the parallelism.
Model-level orchestration: Kimi K2.5 Agent Swarm. The model itself decomposes tasks and manages sub-agents. Parallelism is a trained behavior, not an external wrapper.
Hybrid: Using K2.5 as the base model inside an application-layer agent framework. The model's native parallel decomposition combines with the framework's tool integration and persistence.

The hybrid approach is where K2.5 becomes most interesting for coding agent builders. A model that natively understands parallel task decomposition, combined with a framework that provides codebase search, file editing, and test execution, could reduce the search-read-search-read bottleneck that dominates current coding agent performance.

Using K2.5 with WarpGrep

K2.5's Agent Swarm can dispatch multiple parallel search queries through tools like WarpGrep. Instead of sequential codebase searches where each query waits for the previous result, 10-20 sub-agents can search different parts of the codebase simultaneously. On SWE-bench tasks where search overhead dominates, this alone could recover a significant portion of the 60% time Cognition measured.

Frequently Asked Questions

What is Kimi K2.5?

A 1-trillion-parameter open-weight multimodal model from Moonshot AI (Beijing), released January 27, 2026. It uses mixture-of-experts (32B activated parameters), supports text/image/video inputs, has a 256K context window, and is trained with PARL to orchestrate up to 100 parallel sub-agents. Weights available on HuggingFace under Modified MIT License.

What is Agent Swarm?

Kimi K2.5's parallel execution system. A trained orchestrator decomposes tasks into subtasks, spawns up to 100 frozen sub-agents that execute independently with their own tool access, and aggregates results. This reduces wall-clock time by up to 4.5x on wide-search tasks compared to sequential execution.

How does PARL training work?

Parallel-Agent Reinforcement Learning trains only the orchestrator (sub-agents are frozen). Staged reward shaping initially incentivizes parallel sub-agent instantiation to prevent "serial collapse" (where the model defaults to sequential execution), then gradually shifts to task-success-only rewards once parallel behavior is established.

How does Kimi K2.5 compare to Claude and GPT for coding?

K2.5 scores 76.8% on SWE-bench Verified (vs Claude Opus 4.5's 72.7% at launch). It leads BrowseComp at 74.9-78.4% (vs GPT-5.2's 57.8%). Claude Opus 4.6 and GPT-5.3, released after K2.5, have higher scores on some benchmarks. K2.5's cost advantage ($0.81/M tokens vs $5-25/M for Claude Opus) makes it compelling for high-volume agent workloads.

Can I run Kimi K2.5 locally?

The full 1T model requires significant infrastructure. Quantized versions (FP4, GGUF) from NVIDIA and Unsloth are available on HuggingFace for more accessible deployment. The 32B activated parameter count per forward pass makes inference more tractable than the total parameter count suggests.

What is Kimi Code?

Moonshot released Kimi Code alongside K2.5 as an open-source coding tool. It uses K2.5 as its backend model and provides IDE-like features for code generation, debugging, and refactoring powered by the model's visual coding and agent capabilities.

Parallel Search for Parallel Agents

WarpGrep provides the codebase search layer that agent swarms need. 8 parallel tool calls per turn, sub-6s latency, works as an MCP server inside any coding agent. Pair it with K2.5's Agent Swarm for parallel search across your entire codebase.

Try WarpGrep Free

See Benchmarks

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Kimi K2.5: Agent Swarm, Visual Coding, and Why It Matters for Coding Agents