Claude Code Token Usage: What Burns Tokens and How to Cut It

Claude Code does not send one prompt per task. It makes dozens of API calls per task, and each call re-sends the entire conversation so far. Token usage is not the size of what you typed. It is the size of everything in context, multiplied by the number of turns. That multiplication is why usage balloons, and it is the thing you have to attack to bring it down.

45K / 13K / 38K

Typical input / output / cache-read tokens per session

10x

Cost of a 200K conversation vs a 20K one, per turn

0.1x

Cache read price vs fresh input (90% off)

50-70%

Token reduction from compaction

Why Token Usage Compounds

The unit you are billed on is not the prompt. It is the full context window sent on every API call. Claude Code is stateless between turns at the API level: to continue a conversation, the client re-sends the whole history each time. Your first message, Claude's first response, the file it read in turn 3, the grep output from turn 7, all of it ships again on turn 20.

This makes the cost grow quadratically with conversation length, not linearly. A 20-turn session does not cost 20x the first turn. The early messages are re-sent on every later turn, so the cumulative token count grows with the square of the turn count. A 200K-token conversation costs roughly 10x a 20K one on every single turn it survives.

The fixed overhead compounds the same way. A 2,000-token system prompt sent with every call across a 200-call session is 400,000 input tokens spent on identical repeated text. A 57KB CLAUDE.md (~15,000 tokens) costs 15,000 tokens on every message; a 100-message session pays 1.5M tokens just to re-read the same instructions.

The compounding number that surprises people

One reported 30-day window (Issue #24147 on the claude-code repo) logged 5,092,500,074 cache-read tokens against 3,887,759 tokens of actual input and output. That is 1,310 cache reads per productive token. The complete instruction set is re-sent as cached context with every message, so longer sessions scale non-linearly: in that report, cache reads grew 2.8x while productive I/O grew 1.7x over a single day. This is also why a usage cap can feel artificially depleted: most of the quota goes to architectural overhead, not generated code.

This page is about token consumption, not rate limits. If you are hitting a wall on the five-hour session cap or the weekly cap, see Claude Code usage limits for the plan mechanics. Here the goal is the opposite question: where the tokens go and how to send fewer of them.

What Counts as Token Usage

Usage splits into four billed categories. Knowing which one is growing tells you which lever to pull.

Input tokens

Everything Claude reads on a turn: your messages, file contents, tool results, plus the system prompt and CLAUDE.md if not cached. The largest and most controllable category, because file reads pile in here and stay for the rest of the session.

Output tokens

Everything Claude writes: responses, code, explanations, and extended-thinking blocks. Output bills at 3-5x input. Whole-file rewrites and verbose explanations are the main drivers; thinking tokens count as output too.

Cache write tokens

Billed once, the first time a content block (system prompt, CLAUDE.md, a large file) is written to the prompt cache. Priced at a 1.25x premium over base input. You pay this so that later turns read cheap.

Cache read tokens

Billed every subsequent turn that reuses a cached block, at 0.1x input price. In a long session these dominate the raw token count, but at 90% off they are the cheapest tokens you spend. Volume here is high; dollar weight is low.

The key mental model: a token is not paid for once. A file read on turn 5 of a 40-turn session is sent 36 more times. If it is cached, those 36 re-sends bill at 0.1x; if it is not, they bill at full input price. Either way, the bytes you let into context early are the bytes you pay for repeatedly.

Cache Reads vs Cache Writes: Why Both Exist

Prompt caching exists because of the re-send problem above. Without it, every turn would re-pay full input price for the system prompt, the CLAUDE.md, and the conversation prefix. Caching lets Claude Code store the processed representation of those stable blocks once and reuse them.

Anthropic prices the two operations differently. The first call that caches a block pays a 1.25x write premium over base input. Every later call that hits the same cached prefix pays a cache read at 0.1x base input, a 90% discount. The trade is deliberate: pay slightly more once, then pay almost nothing on every repeat.

Per-MTok pricing by token type (Claude models)

Model	Input	Output	Cache write (1.25x)	Cache read (0.1x)
Opus 4.6	$5	$25	$6.25	$0.50
Sonnet 4.6	$3	$15	$3.75	$0.30
Haiku 4.5	$1	$5	$1.25	$0.10

Reading the numbers: on Sonnet 4.6, a 2,000-token system prompt sent 200 times without caching is 400,000 input tokens at $3/M, or $1.20. With caching, the first call writes at $3.75/M and the next 199 read at $0.30/M, totaling roughly $0.13. Caching cut that line item by ~89%.

This is why a high cache-read count is not automatically bad. The raw number looks alarming (billions of cache reads in a heavy month), but those tokens are billed at a tenth of input price. The thing to watch in /cost is the cache hit rate: a high hit rate means your fixed overhead is being served cheap. A low hit rate means content keeps changing (often a churning CLAUDE.md or shifting tool set), forcing expensive re-writes.

What Burns the Most Tokens

Your prompts and Claude's visible answers are a minority of token flow. The bulk is context you control through configuration and workflow. Ranked by typical impact:

1. Codebase search (file reads)

The single largest variable cost. The agent reads 10-20 files to find one function, and every file stays in context for the rest of the session, re-sent on every turn. Cognition measured agents spending 60% of their time on search.

2. Conversation history

Tool results, prior responses, and reasoning accumulate and get reprocessed every turn. A 100-turn session re-sends turn 1's content 99 more times. This is the quadratic term.

3. CLAUDE.md and memory files

Loaded at launch and shipped with every request. A 57KB CLAUDE.md is ~15K tokens per message. Auto memory (on by default since v2.1.59) adds the first 200 lines or 25KB of MEMORY.md every session.

4. MCP tool definitions

Every connected MCP server injects its tool schemas into context on every turn, whether or not you call them. Five servers with ten tools each is fifty schemas riding along constantly.

The pattern: fixed overhead (CLAUDE.md, MCP, system prompt) is cheap per turn but constant, and prompt caching mostly absorbs it. The variable cost (file reads and growing history) is what actually scales out of control, because it grows every turn and is the hardest to cache (file contents change session to session). That is why the highest-leverage reductions target search and history, not the system prompt.

Measure It: /cost and /context

You cannot reduce what you do not measure. Claude Code ships two built-in commands that answer two different questions: how much have I spent this session, and what is eating my context right now.

/cost: dollars and token totals for the session

Run /cost for a breakdown of input tokens, output tokens, cache read, cache creation, total tokens, and an estimated dollar cost computed locally from published rates. Claude Code v2.1.92 rebuilt the command to add a per-model breakdown, a color-coded cache hit rate (green / amber / red), and rate-limit utilization, with a one-line reminder that cache reads cost roughly 10x less than fresh input.

/cost output (illustrative shape)

> /cost

Session cost: ~$0.34

Tokens
  Input (fresh)      45,200
  Output             13,100
  Cache write         8,400
  Cache read         38,600
  Total             105,300

Cache hit rate: 82%  (green — cache reads bill at 0.1x input)

By model
  claude-sonnet-4-6   $0.29
  claude-haiku-4-5    $0.05

Rate-limit utilization: 31% of 5-hour session

/context: where the current window is going

/context breaks the active context window into components, so you can see which one is filling it. No setup, works in any session.

/context breakdown (typical healthy session, 200K window)

Component	Tokens	Share
System prompt	2.6K	1.3%
System tools (Read, Edit, Bash, Grep, Glob)	17.6K	8.8%
MCP tools	907	0.5%
Custom agents	935	0.5%
Memory files (CLAUDE.md, auto-memory)	302	0.2%
Messages (history + tool output)	30.5K	15.3%
Free space	114K	57.0%
Autocompact buffer	33K	16.5%

Read this table top to bottom before a long session, not after the lockout. If Messages is climbing fast, you are accumulating file reads and tool output; compact or scope your search. If MCP tools is large, prune servers. If Memory files is large, trim CLAUDE.md. The 200K window here is the standard Sonnet context; the full 1M-token window is available at standard pricing with no surcharge tier above 200K.

Track Usage Over Time: ccusage

/cost and /context are per-session and in-the-moment. To see patterns across days, weeks, and projects, use ccusage, an open-source CLI (by @ryoppippi) that parses Claude Code's local JSONL session logs and reports historical totals. It tracks cache creation and cache read tokens separately, runs offline with pre-cached pricing data, and can group usage by project for multi-repo work.

Track historical usage with ccusage

# Run without installing (reads ~/.claude local JSONL logs)
npx ccusage@latest

# Daily breakdown
npx ccusage@latest daily

# Monthly totals
npx ccusage@latest monthly

# Group by project to find which repo burns the most
npx ccusage@latest --by-project

The split of responsibilities: /context for real-time visibility inside a session, /cost for the running session tab, and ccusage for trend analysis over time. Watch the daily series. A steady climb usually means a growing CLAUDE.md or a new MCP server you forgot to remove; a spiky series usually means a few long search-heavy sessions.

Reduce Usage: Compaction

Compaction attacks the variable cost directly: it shrinks the conversation before it is re-sent. The instinct is to wait for Claude Code's auto-compact to fire at the context cliff. By then the agent has already paid full freight for 100+ turns of bloated context. Compacting proactively keeps the conversation lean from the start, and the savings compound because the compacted history is what gets re-sent on every remaining turn.

Most compaction is summarization: the model rewrites the history in fewer words. That introduces hallucination risk. File paths become "a config file," error codes become "an error occurred," function signatures get blurred. The agent then re-reads the same files to recover what the summary threw away, spending the tokens you just saved.

Morph Compact uses verbatim deletion instead. It removes low-signal tokens (redundant formatting, boilerplate, noise) while keeping every surviving sentence character-for-character identical. File paths, error codes, and exact numbers survive intact. The result is 50-70% token reduction at 33,000 tok/s with a 0% hallucination rate, not a low one.

33,000

Tokens per second (Morph Compact)

50-70%

Typical token reduction

~2.5s

To compact a representative payload

Hallucination rate (verbatim deletion)

The cost math is direct. Compact a 200K-token conversation to 80K (60% reduction) and the input cost of the next call drops 60%, and so does every call after it until the context grows again. Over a 200-turn session, compacting early is the difference between paying for 200K-per-turn and 80K-per-turn for most of the run. See LLM cost optimization for how this stacks with the other levers.

Reduce Usage: Scope and Cache the Fixed Overhead

The fixed overhead is cheap per turn but constant, and it is the easiest to cut because it is configuration, not workflow.

Scope the context that loads every session

Keep CLAUDE.md under 200 lines (Anthropic's own guidance). A 15K-token CLAUDE.md costs 15K tokens on every message. Audit @-imports; each one loads at launch. See the CLAUDE.md guide.
Scope rules with paths frontmatter. A rule in .claude/rules/ with paths: ["src/api/**/*.ts"] loads only when Claude touches matching files, not every session.
Remove MCP servers you are not actively using. Their tool schemas ride on every turn whether you call them or not.
Disable auto memory if you do not use it with CLAUDE_CODE_DISABLE_AUTO_MEMORY=1; it loads the first 200 lines or 25KB of MEMORY.md every session by default since v2.1.59.

Cache the stable prefix

Whatever fixed overhead remains should be cached so it bills at 0.1x instead of full input. Claude Code caches the system prompt, tool definitions, and CLAUDE.md automatically when they are stable. The thing that breaks caching is churn: if your CLAUDE.md or MCP set changes between turns, the cache misses and you pay the 1.25x write premium again. A high, stable cache hit rate in /cost is the signal that your fixed overhead is being served cheap.

Order of operations

Scope first, then cache. Trimming a 15K-token CLAUDE.md to 3K before caching means the cache stores fewer tokens and serves faster. Caching a bloated file just makes the bloat cheaper, not smaller. The reductions stack, but only if you shrink before you cache.

Reduce Usage: Keep Search and Edits Out of the Conversation

The largest variable cost is file reading during search, and the cleanest fix is to not let those tokens enter the Claude conversation at all. File-by-file search is brutal on usage: the agent reads 10-20 files to locate one function, every one stays in context for the rest of the session, and every byte is re-sent on every later turn.

WarpGrep runs codebase search on a dedicated trained model that finds the target code in 3.8 steps on average (0.73 F1). The search happens on WarpGrep's model, so the 15 files it read to find the answer never touch your Claude context. Claude gets the answer, not the haystack. Pricing: $0 for 100k requests (free), $1 per 1M requests (Pro).

The same logic applies to writing edits. Whole-file rewrites are expensive output tokens, and output bills at 3-5x input. Morph Fast Apply merges a short lazy diff into the file at 10,500 tok/s on morph-v3-fast, so Claude emits a few changed lines instead of re-emitting the whole file as output. Less output per edit, more edits per session.

For routine turns that do not need a frontier model at all, a router sends them to a cheaper one. Morph Router classifies prompt difficulty at $0.001/request in ~430ms and routes 60-80% of routine requests (formatting, simple edits, boilerplate) to a cheaper model, cutting weighted cost 40-70% without changing output on the hard turns. The combined effect of compaction, scoping, caching, and routing is detailed in the LLM cost optimization guide.

Search off the conversation

WarpGrep finds target code in 3.8 steps on a dedicated model. The files it reads never enter your Claude context. $0 for 100k requests.

Short diffs, not rewrites

Morph Fast Apply merges a lazy diff at 10,500 tok/s, so Claude emits changed lines instead of whole files. Cuts output tokens, the 3-5x category.

Route routine turns down

Morph Router classifies difficulty at $0.001/request and sends 60-80% of routine turns to a cheaper model. 40-70% weighted savings.

Frequently Asked Questions

How do I check token usage in Claude Code?

Run /cost for input, output, cache read, cache creation, total tokens, and an estimated dollar cost for the current session. v2.1.92 added a per-model breakdown, a color-coded cache hit rate, and rate-limit utilization. Run /context to see how the window is split across system prompt, tools, MCP, memory, skills, and history. For trends across sessions, the open-source ccusage CLI reads Claude Code's local JSONL logs and reports daily, monthly, and per-project totals.

What counts as token usage in Claude Code?

Everything in context, not just what you type: your messages, Claude's output, the system prompt, every CLAUDE.md and memory file, MCP tool schemas, and every file read and tool result. It splits into fresh input, output, cache write, and cache read. Because each turn re-sends the full conversation, the same content is billed again on every subsequent turn.

What is the difference between cache read and cache write tokens?

Cache write is billed once, the first time a block is cached, at a 1.25x premium over base input. Cache read is billed every later turn that reuses that block, at 0.1x input (90% off). In long sessions cache reads dominate the raw count: one reported 30-day window logged 5.09 billion cache reads against 3.9 million input/output tokens, because the full instruction set is re-sent as cached context on every message.

Why does Claude Code use so many tokens?

It re-sends the entire conversation on every turn, so a 20-turn session pays for the early messages 20 times. A 2,000-token system prompt across 200 calls is 400,000 input tokens. File-by-file search is the largest contributor: 10-20 files read to find one function, all staying in context. A 57KB CLAUDE.md (~15K tokens) costs 15K cache reads on every message.

How do I reduce token usage in Claude Code?

Compact the conversation before each call (Morph Compact: 50-70% reduction, verbatim deletion, zero hallucination), scope the fixed overhead (CLAUDE.md under 200 lines, prune MCP servers, path-scope rules), cache stable prefixes so they bill at 0.1x, and offload search and edits to dedicated models so those tokens never enter the Claude conversation.

What does the /context command show?

It breaks the active window into system prompt, system tools, MCP tools, custom agents, memory files, skills, conversation messages, free space, and the autocompact buffer. A healthy 200K-window session might show ~2.6K system prompt, ~17.6K system tools, ~30.5K messages, ~114K free. It is the fastest way to find which component is eating context.

Does file reading count toward Claude Code token usage?

Yes, and it is usually the largest variable cost. Every file read enters the conversation and is re-sent on every subsequent turn. Reading 15 files to find one function means those 15 files ride along for the rest of the session. Offloading search to WarpGrep keeps those file bytes out of the Claude conversation.

Spend Your Tokens on Generation, Not File Reads

WarpGrep finds the right code in 3.8 steps on a dedicated search model, so search tokens never enter your Claude conversation. Morph Compact cuts conversation tokens 50-70% by verbatim deletion. 100k WarpGrep requests free.

Try WarpGrep

View Docs

Kimi K3

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers