Benchmarks

Multi-Agent Benchmarks

One model shouldn't run the whole loop. We benchmark planner + executor pairs on end-to-end app builds — the planner model runs at the start of a conversation and after every compaction, the executor runs everything in between.

Planner · Claude Fable 5

Owns the first turn and every turn after compaction. Writes the spec, decomposes work into subtasks, reviews integration.

Executor · Kimi K2.7 Code

Runs the long middle: file edits, tests, tool calls. 12x cheaper per output token than the planner.

Plan · Execution · Compaction · Re-plan
Compaction boundary

Context gets summarized and detail is lost. Quality degrades after two or three compactions when one model runs the whole loop. The planner re-reads state and re-plans before execution resumes.

Fig 1: Mixed-model agent loop

Leaderboard

The Multi-Agent tab weights accuracy 2x against price and speed. Performance, Price, and Speed rank single models on each axis alone.

#PlannerExecutorAccuracy$ / BuildTok/sScore
1Claude Fable 5Kimi K2.7 Code91.6%$64.6044
63.9
2GPT-5.5 CodexDeepSeek V4 Pro87.6%$32.4067
59.9
3Claude Fable 5DeepSeek V4 Flash85.9%$53.6080
54.9
4Claude Fable 5— solo baseline92.4%$185.0048
53.4
5GPT-5.5 CodexKimi K2.7 Code88.9%$39.6046
52.3
6Claude Fable 5MiniMax M389.0%$56.3047
50.8
7Claude Opus 4.8Kimi K2.7 Code87.4%$38.4049
45.1
8Claude Opus 4.8MiniMax M385.1%$30.0051
33.9
9Kimi K2.7 Code— solo baseline84.2%$16.9043
25.0
Fig 2: Weighted composite — accuracy ×2, price ×1, speed ×1
Tasks

40 end-to-end app builds — scaffold, feature, refactor, ship. A build counts as a success only when the app runs and passes its acceptance checks.

Mixed models

The planner model runs at the start of a conversation and after every compaction; the executor runs everything in between. Token profile per build: 4M in / 250K out on the planner, 10M in / 650K out on the executor.

Score

Accuracy, cost per build, and throughput are min-max normalized to 0-100 across all pairs, then combined 2:1:1 — accuracy counts double, price and speed count equally.

Data

Pricing and p50 throughput from the OpenRouter API, June 2026. Speed uses the fastest generally-available provider per model.

Run mixed-model agents in production

Morph serves the executor side of this table — fast apply, search, and compaction APIs that drop into any agent loop.