Benchmarks

Multi-Agent Benchmarks

One model shouldn't run the whole loop. We benchmark planner + executor pairs on end-to-end app builds — the planner model runs at the start of a conversation and after every compaction, the executor runs everything in between.

Planner · Claude Fable 5

Owns the first turn and every turn after compaction. Writes the spec, decomposes work into subtasks, reviews integration.

Executor · Kimi K2.7 Code

Runs the long middle: file edits, tests, tool calls. 12x cheaper per output token than the planner.

PLANEXECUTIONRE-PLANEXECUTIONPlan · Execution · Compaction · Re-plan

Compaction boundary

Context gets summarized and detail is lost. Quality degrades after two or three compactions when one model runs the whole loop. The planner re-reads state and re-plans before execution resumes.

Fig 1: Mixed-model agent loop

Leaderboard

The Multi-Agent tab weights accuracy 2x against price and speed. Performance, Price, and Speed rank single models on each axis alone.

#	Planner	Executor	Accuracy	$ / Build	Tok/s	Score
1	Claude Fable 5	GLM-5.2	92.6%	$66.20	72	87.5
2	Claude Fable 5	Kimi K2.7 Code	91.6%	$64.60	44	62.8
3	GPT-5.5 Codex	DeepSeek V4 Pro	87.6%	$32.40	67	59.4
4	Claude Fable 5	DeepSeek V4 Flash	85.9%	$53.60	80	54.7
5	Claude Fable 5	— solo baseline	92.4%	$185.00	48	52.2
6	GPT-5.5 Codex	Kimi K2.7 Code	88.9%	$39.60	46	51.7
7	Claude Fable 5	MiniMax M3	89.0%	$56.30	47	50.1
8	Claude Opus 4.8	Kimi K2.7 Code	87.4%	$38.40	49	44.6
9	Claude Opus 4.8	MiniMax M3	85.1%	$30.00	51	33.8
10	Kimi K2.7 Code	— solo baseline	84.2%	$16.90	43	25.0

Fig 2: Weighted composite — accuracy ×2, price ×1, speed ×1

Tasks

40 end-to-end app builds — scaffold, feature, refactor, ship. A build counts as a success only when the app runs and passes its acceptance checks.

Mixed models

The planner model runs at the start of a conversation and after every compaction; the executor runs everything in between. Token profile per build: 4M in / 250K out on the planner, 10M in / 650K out on the executor.

Score

Accuracy, cost per build, and throughput are min-max normalized to 0-100 across all pairs, then combined 2:1:1 — accuracy counts double, price and speed count equally.

Data

Pricing and p50 throughput from the OpenRouter API, June 2026. Speed uses the fastest generally-available provider per model.

Run mixed-model agents in production

Morph serves the executor side of this table — fast apply, search, and compaction APIs that drop into any agent loop.

Get Started Book a Call

Kimi K3

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Multi-Agent Benchmarks

Leaderboard

Run mixed-model agents in production