Multi-Agent Benchmarks
One model shouldn't run the whole loop. We benchmark planner + executor pairs on end-to-end app builds — the planner model runs at the start of a conversation and after every compaction, the executor runs everything in between.
Owns the first turn and every turn after compaction. Writes the spec, decomposes work into subtasks, reviews integration.
Runs the long middle: file edits, tests, tool calls. 12x cheaper per output token than the planner.
Context gets summarized and detail is lost. Quality degrades after two or three compactions when one model runs the whole loop. The planner re-reads state and re-plans before execution resumes.
Leaderboard
The Multi-Agent tab weights accuracy 2x against price and speed. Performance, Price, and Speed rank single models on each axis alone.
| # | Planner | Executor | Accuracy | $ / Build | Tok/s | Score |
|---|---|---|---|---|---|---|
| 1 | Claude Fable 5 | Kimi K2.7 Code | 91.6% | $64.60 | 44 | 63.9 |
| 2 | GPT-5.5 Codex | DeepSeek V4 Pro | 87.6% | $32.40 | 67 | 59.9 |
| 3 | Claude Fable 5 | DeepSeek V4 Flash | 85.9% | $53.60 | 80 | 54.9 |
| 4 | Claude Fable 5 | — solo baseline | 92.4% | $185.00 | 48 | 53.4 |
| 5 | GPT-5.5 Codex | Kimi K2.7 Code | 88.9% | $39.60 | 46 | 52.3 |
| 6 | Claude Fable 5 | MiniMax M3 | 89.0% | $56.30 | 47 | 50.8 |
| 7 | Claude Opus 4.8 | Kimi K2.7 Code | 87.4% | $38.40 | 49 | 45.1 |
| 8 | Claude Opus 4.8 | MiniMax M3 | 85.1% | $30.00 | 51 | 33.9 |
| 9 | Kimi K2.7 Code | — solo baseline | 84.2% | $16.90 | 43 | 25.0 |
40 end-to-end app builds — scaffold, feature, refactor, ship. A build counts as a success only when the app runs and passes its acceptance checks.
The planner model runs at the start of a conversation and after every compaction; the executor runs everything in between. Token profile per build: 4M in / 250K out on the planner, 10M in / 650K out on the executor.
Accuracy, cost per build, and throughput are min-max normalized to 0-100 across all pairs, then combined 2:1:1 — accuracy counts double, price and speed count equally.
Pricing and p50 throughput from the OpenRouter API, June 2026. Speed uses the fastest generally-available provider per model.
Run mixed-model agents in production
Morph serves the executor side of this table — fast apply, search, and compaction APIs that drop into any agent loop.