---
title: "Claude Code cost: keep your AI bill flat while usage grows"
url: "https://www.morphllm.com/blog/claude-code-cost"
description: "Claude Code cost climbs because the whole agent loop runs on one frontier model. Here is what you actually pay for, why spend caps do not help, and how per-call model routing keeps the bill flat while token usage keeps growing."
date: "2026-06-27"
author: "Tejas Bhakta"
---
# Claude Code cost: keep your AI bill flat while usage grows

Brian Armstrong posted Coinbase's actual AI bill last week. Stacked dollar bars by org, a line for total tokens. Both climb together for a year. Then the bars come down while the line keeps going. They cut AI spend nearly in half while token usage kept growing.

![AI spend (bars) vs token usage (line) at Coinbase](/images/blog/coinbase-ai-spend.png)

*Coinbase AI spend by org vs total company tokens. Source: [Brian Armstrong](https://x.com/brian_armstrong/status/2070670644577280109).*

If you run Claude Code across a team, you have seen the same shape. Adoption is good, the agents are doing real work, and the bill is the part nobody planned for. The instinct is to assume usage is the problem. It usually is not. The problem is that every token goes to the same model.

## Why the Claude Code bill climbs

An agent loop runs every token through one model, chosen once at the start and never revisited. The model that plans a refactor also renames the variable, reads the test output, and fixes the import. Planning needs a frontier model. The other three do not, and they are most of the requests.

So the bill tracks the most expensive thing the agent does, applied to everything it does. A session that is 90% mechanical still pays frontier prices on all of it. Multiply that across a team and the curve is steep before anyone has done anything wrong.

## What you are actually paying for

The first step to controlling a number is knowing what moves it. Claude Code spend comes down to a few things:

- **Tokens, input and output.** You pay for the whole conversation on every turn, not just the answer. A long agent run re-reads its own context repeatedly, so input tokens dominate. Output is the small part.
- **Subscription vs API.** Pro, Max, Team, and Enterprise seats bundle a usage allotment; the Anthropic API bills per token. Team and Enterprise seats include Claude Code and bill any extra usage at standard API rates. A subscription smooths the cost until a heavy user blows through the allotment, then the overage is back to per-token.
- **Extra usage and overage.** The line that surprises people. Once a power user passes the seat allotment, the rest of the month meters at API rates with no ceiling unless an admin sets one.
- **Cache.** Anthropic bills cached prompt tokens at a fraction of fresh ones. Whether your repeated context hits that cache is the difference between a cheap session and an expensive one, and most setups leave it on the floor.

Run `/cost` inside a Claude Code session and you see the per-session version of all of this. The team version is the chart above.

## Caps and alerts are the wrong lever

When the bill grows, the reflex is friction. Lower the caps, turn on spend alerts, make engineers ask permission. Coinbase checked first: 91% of their engineers never hit a cap. Caps would have produced alerts, not savings, and slowed down the 91% to discipline the few.

Friction also fights the thing you want. You bought the agents to make people faster. Rationing them by dollar is a tax on the exact behavior you were trying to encourage.

## Three levers that actually work

So Coinbase went the other way. Cheaper defaults, cache-aware requests, and routing. The first two are habits a team can adopt this week. Routing is the one you have to build, and it is the one that bends the curve.

## Routing: pick a model per call, not per session

Brian's sharpest line is the one most teams skip: "humans shouldn't be choosing models." A person picks a model at session start out of habit and rides it for an hour. A router reads each request and picks per call. Hard prompts go to the frontier model. The rest go to a cheaper model that answers them just as well. Coinbase defaults to GLM 5.2 for exactly this.

The decision is not one knob. A useful router scores each turn on how hard it is and how clearly it is specified, then maps that to a model and a reasoning effort. Trivial edits and lookups go to the small model at low effort. Architecture and tricky debugging go to the frontier model. Everything in between lands where it should instead of at the top.

![Morph's model router: a difficulty by ambiguity matrix mapping each turn to a model and reasoning effort, editable company-wide](/images/blog/model-router-dashboard.png)

*Set the policy once and it applies to everyone. If spend spikes, bias the grid toward cheaper models company-wide; the change reaches every developer's agent within the hour.*

Cache decides whether routing helps or hurts. Switch models the naive way and you throw away a warm prefix, so the cheaper model arrives with a cold cache and costs more than the model you left. A cache-aware router keeps the prefix warm and only moves work when the move pays for itself. Coinbase took LibreChat from a 5% to a 60% cache hit rate. Most of a repeated prompt stops getting billed.

## What this looks like in practice

This is what `auto` does on Morph. One model name. You send the request, it reads the prompt, and it routes across [the open-weight models we serve](/products/models), GLM 5.2 among them, cache-aware, at $0.005 a request. The hard prompts still get a frontier-class answer. The easy ones stop paying frontier prices.

For Claude Code specifically, the same router runs in front of the agent, so the matrix above is the control surface for your whole team's spend. Engineers keep using Claude Code exactly as they do today. The bill is what changes.

## Common questions

**How much does Claude Code cost?** It depends entirely on the mix of work and which model serves it. A seat with a subscription has a fixed floor; heavy use past the allotment meters per token. The number most teams quote is dominated by frontier-model tokens spent on requests that did not need a frontier model.

**Does Claude Code cost money beyond the subscription?** Yes, once usage passes the seat allotment. Extra usage bills at API rates, which is why a few heavy users can drive most of the bill.

**Why is it more expensive than I expected per month?** Input tokens. The agent re-reads its context every turn, so a long session pays for the same code many times. Cache and routing both attack this.

**Claude Code vs Cursor cost?** Both bill on tokens underneath, whether through a subscription or the API. The lever that moves either one is the same: stop sending every token to the most expensive model, and keep the cache warm.

**How do I see what it costs?** `/cost` for a session. For a team, you want per-request logging by model so you can see where the spend actually goes, which is the precondition for routing it.

The teams whose bills flatten are not sending fewer tokens. They stopped sending all of them to the same model.
