Multi-Agent Model Routing: Planner + Executor Pairs

Every turn of a single-model coding agent hits the same frontier model. File reads, tool call parsing, typo fixes, and import additions all pay $5/M input tokens. The planner-executor split fixes this: frontier model for reasoning (4M tokens / build), cheaper model for execution (10M tokens / build). Benchmarked on 40 real app builds. Median cost reduction: 4x on execution-side spend.

4.4×

Cost reduction vs monolithic agents (PEAR benchmark)

End-to-end app builds in Morph benchmark

14M

Tokens per build (4M planner + 10M executor)

2:1:1

Accuracy : price : speed score weighting

The Single-Model Cost Problem

A coding agent running Opus 4.8 on every turn processes roughly 14M input tokens per end-to-end app build. At $5/M input tokens, that is $0.07 in input costs alone — before output, before caching, before the multiple builds needed to get a task right. A 20-developer team running 10 builds per day per developer spends $28K/month on input tokens.

Most of that spend goes to turns that do not need a frontier model. File reads, tool call parsing, mechanical edits, test command invocations — these are execution tasks. They require the model to follow instructions and produce valid structured output, not to reason about novel problems. A mid-tier model handles them identically at 3-5x lower cost.

The planning turns are different. Reading the task, understanding the codebase, decomposing the problem, deciding the approach — these require the full reasoning capacity of a frontier model. The mistake is applying frontier-model pricing to execution turns that a cheaper model handles equally well.

Where the tokens go in a coding agent session

Analysis of 40 app builds: ~30% of tokens go to planning turns (task decomposition, synthesis, compaction summaries). ~70% go to execution turns (tool calls, file reads, mechanical edits, test runs). The planning tokens justify frontier pricing. The execution tokens do not.

How Planner + Executor Routing Works

The planner-executor split assigns a different model to each role in the agent loop:

Planner (Frontier Model)

Fires at the start of a session and after every context compaction. Reads the task, understands the codebase state, decomposes the problem, and produces a plan. Typically 4M input / 250K output tokens per build. Uses Opus 4.8, Fable 5 (currently suspended, see note), or GPT-5.5.

Executor (Mid-Tier Model)

Handles every turn in between compaction events. Follows the plan, makes tool calls, reads files, applies edits, runs tests. Typically 10M input / 650K output per build. Uses Sonnet 4.6, Haiku 4.5, or Gemini 3.1 Flash.

Compaction Trigger

When context approaches the model's compaction threshold, the planner re-fires. It reads the compacted summary and updates the plan for the next execution phase. This prevents context rot while keeping planning on the frontier model.

Score Composition

Morph's benchmark weights accuracy 2x, price 1x, speed 1x. The top-scoring pairs are not the cheapest or the fastest alone — they are the ones where the executor is cheap enough to shift the weighted cost materially without degrading build success rate.

The architecture is a thin routing layer, not a full orchestration framework. Your agent loop checks which phase it is in (planning vs execution) and sets the model accordingly. The actual LLM calls go directly to the provider API. No proxy, no middleware overhead.

Benchmark: 40 Real App Builds

Morph tested 44 planner-executor pairs on 40 end-to-end app builds across four task types: scaffold a new app, add a feature, refactor a module, ship to production. A build counts as successful only when the app runs and passes its acceptance checks — no partial credit.

Benchmark Methodology

Dimension	Detail
Task set	40 end-to-end app builds (scaffold, feature, refactor, ship)
Success criterion	App runs + passes all acceptance checks
Model pairs tested	44 planner-executor combinations
Token profile	4M in / 250K out (planner), 10M in / 650K out (executor)
Pricing source	OpenRouter API, June 2026
Score formula	Accuracy (×2) + Price (×1) + Speed (×1), min-max normalized to 0-100

The full leaderboard — ranked by the combined 2:1:1 score and filterable by accuracy-only, price-only, and speed-only — is at morphllm.com/benchmarks/multiagent.

Why accuracy weights 2x

A build that fails costs more than the model tokens: developer time to diagnose and retry. A 10% reduction in success rate on a $2/build task is worth roughly $0.20 per attempt in rework cost at $100/hr developer rates. Weighting accuracy double reflects that the failure cost dominates the token cost on most real workloads.

Token Profile: Where the Spend Goes

Across 40 builds, the median token breakdown per build:

Token Profile per Build (Median, 40 Builds)

Turn Type	Input Tokens	Output Tokens	Cost at Opus 4.8 ($5/$25/M)	Cost at Sonnet 4.6 ($3/$15/M)
Planner turns	4M	250K	$20.00 + $6.25 = $26.25	$12.00 + $3.75 = $15.75
Executor turns	10M	650K	$50.00 + $16.25 = $66.25	$30.00 + $9.75 = $39.75
Total (all Opus 4.8)	14M	900K	$92.50	—
Total (Opus plan + Sonnet exec)	14M	900K	$26.25 + $39.75 = $66.00	—
Total (Opus plan + Haiku exec)	14M	900K	$26.25 + $13.25 = $39.50	—

Switching executor from Opus 4.8 to Sonnet 4.6 saves $26.25 per build (28%). Switching to Haiku 4.5 ($1/$5 per M) saves $52.75 per build (57%). The planner cost is fixed at $26.25 regardless of executor choice — it always runs on the frontier model.

Cache pricing compounds these savings. Anthropic charges 10% for cache reads. With a stable system prompt and long file context, execution turns often hit 60-80% cache hit rates. A 70% cache hit rate on Sonnet 4.6 executor input drops the executor input cost from $30M to $30M x (0.3 + 0.7 x 0.1) = $30M x 0.37 = $11.10 per 10M tokens, reducing total cost to roughly $37.35 per build — a 60% reduction from the all-Opus baseline.

Model Tier Selection

Not all model pairs perform equally. The executor must handle the agent's full tool call schema (file read, file write, bash, search) and produce valid JSON on every turn. Models that drop tool call arguments or produce malformed JSON on complex schemas fail builds regardless of their benchmark scores on clean chat tasks.

Recommended Model Pairs (June 2026)

Pair	Planner	Executor	Cost / Build (est.)	Notes
Max quality	Claude Fable 5 ($10/$50/M, currently suspended, see note)	Sonnet 4.6 ($3/$15/M)	~$47	Highest accuracy, 2x planner cost vs Opus 4.8
Balanced	Opus 4.8 ($5/$25/M)	Sonnet 4.6 ($3/$15/M)	~$66	Recommended starting point
Cost-optimized	Opus 4.8 ($5/$25/M)	Haiku 4.5 ($1/$5/M)	~$40	Best cost/accuracy tradeoff in benchmark
Open-weight	Qwen3.5 235B (via OpenRouter)	Qwen3.5 30B	~$12	No proprietary API dependency
Speed-optimized	GPT-5.5 ($3/$15/M)	Gemini 3.1 Flash ($0.10/$0.40/M)	~$25	Lowest latency pair in benchmark

See the full ranked leaderboard — filtered by accuracy, price, or speed — at benchmarks/multiagent.

Implementation

The planner-executor split is a routing decision at session start, not a framework change. Your agent loop tracks phase state and switches models at compaction boundaries.

Planner-executor routing (TypeScript)

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

// Model assignments
const PLANNER_MODEL = "claude-opus-4-8-20260528";   // Frontier: planning only
const EXECUTOR_MODEL = "claude-sonnet-4-6-20260601"; // Mid-tier: execution turns

type Phase = "planning" | "executing";

interface AgentState {
  phase: Phase;
  turnsSinceCompaction: number;
  compactionThreshold: number; // e.g. 80 turns
}

async function agentTurn(
  messages: Anthropic.MessageParam[],
  state: AgentState,
  tools: Anthropic.Tool[],
): Promise<{ response: Anthropic.Message; state: AgentState }> {
  // Determine model based on phase
  const model =
    state.phase === "planning" ? PLANNER_MODEL : EXECUTOR_MODEL;

  const response = await anthropic.messages.create({
    model,
    max_tokens: 8192,
    tools,
    messages,
  });

  // After planning turn: switch to execution
  let nextState = { ...state };
  if (state.phase === "planning") {
    nextState.phase = "executing";
    nextState.turnsSinceCompaction = 0;
  } else {
    nextState.turnsSinceCompaction += 1;
    // Trigger compaction + re-plan at threshold
    if (nextState.turnsSinceCompaction >= state.compactionThreshold) {
      nextState.phase = "planning"; // Next turn fires planner
    }
  }

  return { response, state: nextState };
}

// Session start: always plan first
const initialState: AgentState = {
  phase: "planning",
  turnsSinceCompaction: 0,
  compactionThreshold: 80,
};

The compaction threshold determines how often the planner re-fires. Lower thresholds (every 30-50 turns) give the planner more opportunity to re-evaluate and correct drift at higher cost. Higher thresholds (80-120 turns) are cheaper but risk the executor drifting from the original plan.

Integrating with Morph context compaction

import { morph } from "morph";

// Before firing the planner, compact accumulated executor context
async function compactAndReplan(
  executorHistory: Anthropic.MessageParam[],
  originalTask: string,
): Promise<Anthropic.MessageParam[]> {
  // Compact executor turns: ~33,000 tok/s, 50-70% reduction
  const { compacted } = await morph.compact({
    messages: executorHistory,
    targetRatio: 0.4, // Reduce to 40% of original token count
  });

  // Planner receives the compacted summary + original task
  return [
    { role: "user", content: originalTask },
    {
      role: "assistant",
      content: `Progress so far: ${compacted.summary}. Continue.`,
    },
  ];
}

Single-Request vs Multi-Agent Routing

Two complementary approaches, not competing ones:

Single-request routing (Morph Router)

Classifies each individual prompt and routes to the cheapest capable model per-turn. Easy turns (boilerplate, formatting) go to Haiku. Hard turns (debugging, architecture) go to Opus. ~430ms classification at $0.001/request. Best for interactive single-agent sessions where task difficulty varies unpredictably.

Multi-agent routing (planner-executor)

Assigns a fixed model to each architectural role. Planner always gets frontier. Executor always gets mid-tier. No per-turn classification overhead. Best for long autonomous builds where the executor role is well-defined and execution turns are reliably mechanical.

They stack: apply single-request routing within the executor to further reduce cost on the easiest executor turns. An executor turn that is routing a trivial file read can use Haiku even if the executor default is Sonnet. This delivers planner-executor savings plus per-turn routing savings on top.

For single-request routing details and the classification API, see LLM Router. For total cost reduction across all levers (routing, compaction, caching, batching), see LLM Cost Optimization.

Frequently Asked Questions

What is multi-agent model routing?

Multi-agent model routing assigns different LLM models to different roles in an agent loop. The planner uses a frontier model for reasoning. The executor uses a cheaper, faster model for mechanical turns. On a typical 14M-token build, this cuts cost 4x compared to running the frontier model everywhere.

What is the planner-executor split?

The planner fires at session start and after each context compaction: it reads the task, understands codebase state, and produces a plan. The executor handles every turn in between: tool calls, file reads, edits, test runs. Token profile per build: 4M input / 250K output on the planner, 10M input / 650K output on the executor.

How much does planner-executor routing reduce costs?

Switching executor from Opus 4.8 to Sonnet 4.6 saves 28% per build. Switching to Haiku 4.5 saves 57%. With prompt caching at 70% hit rate on executor turns, total savings reach 60%+ versus the all-Opus baseline. The PEAR benchmark reported a 4.4x cost reduction for Plan-Execute versus Reflexion on the same task set ($1.24 vs $5.12 per task).

What models work best as planner vs executor?

Planner: Claude Opus 4.8, Claude Fable 5 (currently suspended, see note), GPT-5.5. Executor: Claude Sonnet 4.6, Claude Haiku 4.5, Gemini 3.1 Flash. The executor constraint is reliable tool call JSON: models that drop arguments or produce malformed structured output on complex schemas fail builds regardless of chat benchmark scores. See the full leaderboard for tested pairs.

How is this different from Morph Router?

Morph Router classifies each individual prompt and routes per-turn. Planner-executor routing assigns a fixed model to each architectural role, with no per-turn classification overhead. They stack: use Morph Router within the executor to further reduce cost on trivially easy executor turns.

Where can I see the full benchmark results?

Full leaderboard with 44 model pairs ranked by accuracy, price, speed, and combined score: morphllm.com/benchmarks/multiagent.

Related Resources

Benchmark Your Model Pair

See how your planner-executor combination scores on 40 real app builds. Full leaderboard with 44 pairs ranked by accuracy, cost, and speed at morphllm.com/benchmarks/multiagent.

View Multi-Agent Benchmarks

Try Morph Router

Kimi K3

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Multi-Agent Model Routing: Planner + Executor Pairs That Cut Costs 4x