Multi-Agent Model Routing: Planner + Executor Pairs That Cut Costs 4x

A coding agent running Opus 4.8 for every turn costs $5/M input on 14M tokens per build. Split planner and executor: frontier model for planning (4M tokens), cheaper model for execution (10M tokens). Benchmarked on 40 real app builds. Data from June 2026.

June 15, 2026 · 2 min read
Multi-Agent Model Routing: Planner + Executor Pairs That Cut Costs 4x

Every turn of a single-model coding agent hits the same frontier model. File reads, tool call parsing, typo fixes, and import additions all pay $5/M input tokens. The planner-executor split fixes this: frontier model for reasoning (4M tokens / build), cheaper model for execution (10M tokens / build). Benchmarked on 40 real app builds. Median cost reduction: 4x on execution-side spend.

4.4×
Cost reduction vs monolithic agents (PEAR benchmark)
40
End-to-end app builds in Morph benchmark
14M
Tokens per build (4M planner + 10M executor)
2:1:1
Accuracy : price : speed score weighting

The Single-Model Cost Problem

A coding agent running Opus 4.8 on every turn processes roughly 14M input tokens per end-to-end app build. At $5/M input tokens, that is $0.07 in input costs alone — before output, before caching, before the multiple builds needed to get a task right. A 20-developer team running 10 builds per day per developer spends $28K/month on input tokens.

Most of that spend goes to turns that do not need a frontier model. File reads, tool call parsing, mechanical edits, test command invocations — these are execution tasks. They require the model to follow instructions and produce valid structured output, not to reason about novel problems. A mid-tier model handles them identically at 3-5x lower cost.

The planning turns are different. Reading the task, understanding the codebase, decomposing the problem, deciding the approach — these require the full reasoning capacity of a frontier model. The mistake is applying frontier-model pricing to execution turns that a cheaper model handles equally well.

Where the tokens go in a coding agent session

Analysis of 40 app builds: ~30% of tokens go to planning turns (task decomposition, synthesis, compaction summaries). ~70% go to execution turns (tool calls, file reads, mechanical edits, test runs). The planning tokens justify frontier pricing. The execution tokens do not.

How Planner + Executor Routing Works

The planner-executor split assigns a different model to each role in the agent loop:

Planner (Frontier Model)

Fires at the start of a session and after every context compaction. Reads the task, understands the codebase state, decomposes the problem, and produces a plan. Typically 4M input / 250K output tokens per build. Uses Opus 4.8, Fable 5, or GPT-5.5.

Executor (Mid-Tier Model)

Handles every turn in between compaction events. Follows the plan, makes tool calls, reads files, applies edits, runs tests. Typically 10M input / 650K output per build. Uses Sonnet 4.6, Haiku 4.5, or Gemini 3.1 Flash.

Compaction Trigger

When context approaches the model's compaction threshold, the planner re-fires. It reads the compacted summary and updates the plan for the next execution phase. This prevents context rot while keeping planning on the frontier model.

Score Composition

Morph's benchmark weights accuracy 2x, price 1x, speed 1x. The top-scoring pairs are not the cheapest or the fastest alone — they are the ones where the executor is cheap enough to shift the weighted cost materially without degrading build success rate.

The architecture is a thin routing layer, not a full orchestration framework. Your agent loop checks which phase it is in (planning vs execution) and sets the model accordingly. The actual LLM calls go directly to the provider API. No proxy, no middleware overhead.

Benchmark: 40 Real App Builds

Morph tested 44 planner-executor pairs on 40 end-to-end app builds across four task types: scaffold a new app, add a feature, refactor a module, ship to production. A build counts as successful only when the app runs and passes its acceptance checks — no partial credit.

DimensionDetail
Task set40 end-to-end app builds (scaffold, feature, refactor, ship)
Success criterionApp runs + passes all acceptance checks
Model pairs tested44 planner-executor combinations
Token profile4M in / 250K out (planner), 10M in / 650K out (executor)
Pricing sourceOpenRouter API, June 2026
Score formulaAccuracy (×2) + Price (×1) + Speed (×1), min-max normalized to 0-100

The full leaderboard — ranked by the combined 2:1:1 score and filterable by accuracy-only, price-only, and speed-only — is at morphllm.com/benchmarks/multiagent.

Why accuracy weights 2x

A build that fails costs more than the model tokens: developer time to diagnose and retry. A 10% reduction in success rate on a $2/build task is worth roughly $0.20 per attempt in rework cost at $100/hr developer rates. Weighting accuracy double reflects that the failure cost dominates the token cost on most real workloads.

Token Profile: Where the Spend Goes

Across 40 builds, the median token breakdown per build:

Turn TypeInput TokensOutput TokensCost at Opus 4.8 ($5/$25/M)Cost at Sonnet 4.6 ($3/$15/M)
Planner turns4M250K$20.00 + $6.25 = $26.25$12.00 + $3.75 = $15.75
Executor turns10M650K$50.00 + $16.25 = $66.25$30.00 + $9.75 = $39.75
Total (all Opus 4.8)14M900K$92.50
Total (Opus plan + Sonnet exec)14M900K$26.25 + $39.75 = $66.00
Total (Opus plan + Haiku exec)14M900K$26.25 + $13.25 = $39.50

Switching executor from Opus 4.8 to Sonnet 4.6 saves $26.25 per build (28%). Switching to Haiku 4.5 ($1/$5 per M) saves $52.75 per build (57%). The planner cost is fixed at $26.25 regardless of executor choice — it always runs on the frontier model.

Cache pricing compounds these savings. Anthropic charges 10% for cache reads. With a stable system prompt and long file context, execution turns often hit 60-80% cache hit rates. A 70% cache hit rate on Sonnet 4.6 executor input drops the executor input cost from $30M to $30M x (0.3 + 0.7 x 0.1) = $30M x 0.37 = $11.10 per 10M tokens, reducing total cost to roughly $37.35 per build — a 60% reduction from the all-Opus baseline.

Model Tier Selection

Not all model pairs perform equally. The executor must handle the agent's full tool call schema (file read, file write, bash, search) and produce valid JSON on every turn. Models that drop tool call arguments or produce malformed JSON on complex schemas fail builds regardless of their benchmark scores on clean chat tasks.

PairPlannerExecutorCost / Build (est.)Notes
Max qualityClaude Fable 5 ($10/$50/M)Sonnet 4.6 ($3/$15/M)~$47Highest accuracy, 2x planner cost vs Opus 4.8
BalancedOpus 4.8 ($5/$25/M)Sonnet 4.6 ($3/$15/M)~$66Recommended starting point
Cost-optimizedOpus 4.8 ($5/$25/M)Haiku 4.5 ($1/$5/M)~$40Best cost/accuracy tradeoff in benchmark
Open-weightQwen3.5 235B (via OpenRouter)Qwen3.5 30B~$12No proprietary API dependency
Speed-optimizedGPT-5.5 ($3/$15/M)Gemini 3.1 Flash ($0.10/$0.40/M)~$25Lowest latency pair in benchmark

See the full ranked leaderboard — filtered by accuracy, price, or speed — at benchmarks/multiagent.

Implementation

The planner-executor split is a routing decision at session start, not a framework change. Your agent loop tracks phase state and switches models at compaction boundaries.

Planner-executor routing (TypeScript)

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

// Model assignments
const PLANNER_MODEL = "claude-opus-4-8-20260528";   // Frontier: planning only
const EXECUTOR_MODEL = "claude-sonnet-4-6-20260601"; // Mid-tier: execution turns

type Phase = "planning" | "executing";

interface AgentState {
  phase: Phase;
  turnsSinceCompaction: number;
  compactionThreshold: number; // e.g. 80 turns
}

async function agentTurn(
  messages: Anthropic.MessageParam[],
  state: AgentState,
  tools: Anthropic.Tool[],
): Promise<{ response: Anthropic.Message; state: AgentState }> {
  // Determine model based on phase
  const model =
    state.phase === "planning" ? PLANNER_MODEL : EXECUTOR_MODEL;

  const response = await anthropic.messages.create({
    model,
    max_tokens: 8192,
    tools,
    messages,
  });

  // After planning turn: switch to execution
  let nextState = { ...state };
  if (state.phase === "planning") {
    nextState.phase = "executing";
    nextState.turnsSinceCompaction = 0;
  } else {
    nextState.turnsSinceCompaction += 1;
    // Trigger compaction + re-plan at threshold
    if (nextState.turnsSinceCompaction >= state.compactionThreshold) {
      nextState.phase = "planning"; // Next turn fires planner
    }
  }

  return { response, state: nextState };
}

// Session start: always plan first
const initialState: AgentState = {
  phase: "planning",
  turnsSinceCompaction: 0,
  compactionThreshold: 80,
};

The compaction threshold determines how often the planner re-fires. Lower thresholds (every 30-50 turns) give the planner more opportunity to re-evaluate and correct drift at higher cost. Higher thresholds (80-120 turns) are cheaper but risk the executor drifting from the original plan.

Integrating with Morph context compaction

import { morph } from "morph";

// Before firing the planner, compact accumulated executor context
async function compactAndReplan(
  executorHistory: Anthropic.MessageParam[],
  originalTask: string,
): Promise<Anthropic.MessageParam[]> {
  // Compact executor turns: ~33,000 tok/s, 50-70% reduction
  const { compacted } = await morph.compact({
    messages: executorHistory,
    targetRatio: 0.4, // Reduce to 40% of original token count
  });

  // Planner receives the compacted summary + original task
  return [
    { role: "user", content: originalTask },
    {
      role: "assistant",
      content: `Progress so far: ${compacted.summary}. Continue.`,
    },
  ];
}

Single-Request vs Multi-Agent Routing

Two complementary approaches, not competing ones:

Single-request routing (Morph Router)

Classifies each individual prompt and routes to the cheapest capable model per-turn. Easy turns (boilerplate, formatting) go to Haiku. Hard turns (debugging, architecture) go to Opus. ~430ms classification at $0.001/request. Best for interactive single-agent sessions where task difficulty varies unpredictably.

Multi-agent routing (planner-executor)

Assigns a fixed model to each architectural role. Planner always gets frontier. Executor always gets mid-tier. No per-turn classification overhead. Best for long autonomous builds where the executor role is well-defined and execution turns are reliably mechanical.

They stack: apply single-request routing within the executor to further reduce cost on the easiest executor turns. An executor turn that is routing a trivial file read can use Haiku even if the executor default is Sonnet. This delivers planner-executor savings plus per-turn routing savings on top.

For single-request routing details and the classification API, see LLM Router. For total cost reduction across all levers (routing, compaction, caching, batching), see LLM Cost Optimization.

Frequently Asked Questions

What is multi-agent model routing?

Multi-agent model routing assigns different LLM models to different roles in an agent loop. The planner uses a frontier model for reasoning. The executor uses a cheaper, faster model for mechanical turns. On a typical 14M-token build, this cuts cost 4x compared to running the frontier model everywhere.

What is the planner-executor split?

The planner fires at session start and after each context compaction: it reads the task, understands codebase state, and produces a plan. The executor handles every turn in between: tool calls, file reads, edits, test runs. Token profile per build: 4M input / 250K output on the planner, 10M input / 650K output on the executor.

How much does planner-executor routing reduce costs?

Switching executor from Opus 4.8 to Sonnet 4.6 saves 28% per build. Switching to Haiku 4.5 saves 57%. With prompt caching at 70% hit rate on executor turns, total savings reach 60%+ versus the all-Opus baseline. The PEAR benchmark reported a 4.4x cost reduction for Plan-Execute versus Reflexion on the same task set ($1.24 vs $5.12 per task).

What models work best as planner vs executor?

Planner: Claude Opus 4.8, Claude Fable 5, GPT-5.5. Executor: Claude Sonnet 4.6, Claude Haiku 4.5, Gemini 3.1 Flash. The executor constraint is reliable tool call JSON: models that drop arguments or produce malformed structured output on complex schemas fail builds regardless of chat benchmark scores. See the full leaderboard for tested pairs.

How is this different from Morph Router?

Morph Router classifies each individual prompt and routes per-turn. Planner-executor routing assigns a fixed model to each architectural role, with no per-turn classification overhead. They stack: use Morph Router within the executor to further reduce cost on trivially easy executor turns.

Where can I see the full benchmark results?

Full leaderboard with 44 model pairs ranked by accuracy, price, speed, and combined score: morphllm.com/benchmarks/multiagent.

Related Resources

Benchmark Your Model Pair

See how your planner-executor combination scores on 40 real app builds. Full leaderboard with 44 pairs ranked by accuracy, cost, and speed at morphllm.com/benchmarks/multiagent.