---
title: "Keep AI spend flat while token usage grows"
url: "https://www.morphllm.com/blog/keep-ai-spend-flat"
description: "Coinbase cut its AI bill nearly in half while token usage kept growing. The lever that did the most work was routing: send each request to the cheapest model that can handle it, cache-aware, instead of running the whole agent loop on one frontier model."
date: "2026-06-26"
author: "Tejas Bhakta"
---
# Keep AI spend flat while token usage grows

Brian Armstrong posted Coinbase's actual AI bill last week. Stacked dollar bars by org, a line for total tokens. Both climb together for a year. Then the bars come down while the line keeps going. They cut AI spend nearly in half while token usage kept growing.

![AI spend (bars) vs token usage (line) at Coinbase](/images/blog/coinbase-ai-spend.png)

*Coinbase AI spend by org vs total company tokens. Source: [Brian Armstrong](https://x.com/brian_armstrong/status/2070670644577280109).*

When the bill grows, the reflex is friction. Lower the caps, turn on spend alerts, make engineers ask permission. Coinbase checked first: 91% of their engineers never hit a cap. Caps would have produced alerts, not savings.

So they went the other way. Cheaper defaults, cache-aware requests, and routing. The first two are habits a team can adopt this week. Routing is the one you have to build, and it's the one that bends the curve.

Every token in an agent loop hits one model, chosen once at the start and never revisited. The model that plans a refactor also renames the variable, reads the test output, and fixes the import. Planning needs a frontier model. The other three don't, and they're most of the requests.

Brian's sharpest line is the one most teams skip: "humans shouldn't be choosing models." A person picks a model at session start out of habit and rides it for an hour. A router reads each request and picks per call. Hard prompts go to the frontier model. The rest go to an open-weight model that costs a fraction and answers them just as well. Coinbase defaults to GLM 5.2 for exactly this.

Cache decides whether routing helps or hurts. Switch models the naive way and you throw away a warm prefix, so the cheaper model arrives with a cold cache and costs more than the model you left. A cache-aware router keeps the prefix warm and only moves work when the move pays for itself. Coinbase took LibreChat from a 5% to a 60% cache hit rate. Most of a repeated prompt stops getting billed.

This is what `auto` does on Morph. One model name. You send the request, it reads the prompt, and it routes across [the open-weight models we serve](/products/models), GLM-5.2 among them, cache-aware, at $0.005 a request. The hard prompts still get a frontier-class answer. The easy ones stop paying frontier prices.

The teams whose bills flatten aren't sending fewer tokens. They stopped sending all of them to the same model.