Every Claude rate limit across API tiers, consumer plans, and Claude Code, in one place. Updated for March 2026, including the Max plan changes and token bucket mechanics.
How Token Bucket Rate Limiting Works
Anthropic uses a token bucket algorithm, not fixed-window resets. Your bucket has a maximum capacity equal to your per-minute limit. Tokens flow in at a constant rate. Each API request removes tokens. If the bucket is empty, you get a 429 rate_limit_error.
The practical effect: short bursts above your sustained rate are fine, as long as you have accumulated capacity. You do not need to wait for a reset window. Capacity replenishes every second, proportional to your tier limit.
The system tracks three independent dimensions. Hitting any one triggers throttling:
RPM
Requests per minute. Total API calls regardless of size. The bluntest constraint.
ITPM
Input tokens per minute. Your prompts, system messages, and context. Only uncached tokens count.
OTPM
Output tokens per minute. Model-generated text. The hardest to predict in advance.
Key optimization
Cached input tokens do not count toward ITPM. With 80% cache hit rate, your effective ITPM is 5x the nominal limit. Enable prompt caching before upgrading tiers.
API Rate Limits by Tier
Tier upgrades are based on cumulative spend, not monthly spend. Once you reach a tier, you stay there. Each model has independent limits within the same tier.
Claude Sonnet 4
| Tier | Spend Required | RPM | ITPM | OTPM |
|---|---|---|---|---|
| Free | $0 | ~5 | Low | Low |
| Tier 1 | $5 | 50 | 30,000 | 10,000 |
| Tier 2 | $40 | 1,000 | 450,000 | 90,000 |
| Tier 3 | $200 | 2,000 | 800,000 | 200,000 |
| Tier 4 | $400 | 4,000 | 2,000,000 | 400,000 |
Claude Opus 4
| Tier | Spend Required | RPM | ITPM | OTPM |
|---|---|---|---|---|
| Tier 1 | $5 | 50 | 30,000 | 10,000 |
| Tier 2 | $40 | 1,000 | 150,000 | 30,000 |
| Tier 3 | $200 | 2,000 | 400,000 | 80,000 |
| Tier 4 | $400 | 4,000 | 800,000 | 200,000 |
Claude Haiku 3.5
| Tier | Spend Required | RPM | ITPM | OTPM |
|---|---|---|---|---|
| Tier 1 | $5 | 50 | 30,000 | 10,000 |
| Tier 2 | $40 | 1,000 | 500,000 | 100,000 |
| Tier 3 | $200 | 2,000 | 1,000,000 | 200,000 |
| Tier 4 | $400 | 4,000 | 4,000,000 | 800,000 |
Model Context and Output Limits
Context window and max output tokens are per-request limits, independent of rate limits. They determine how much you can send and receive in a single API call.
| Model | Context Window | Max Output Tokens |
|---|---|---|
| Claude Opus 4 | 200K tokens | 16,384 tokens |
| Claude Sonnet 4 | 200K tokens | 16,384 tokens |
| Claude Haiku 3.5 | 200K tokens | 8,192 tokens |
| Enterprise (select) | 500K tokens | Varies |
The 200K context window fits roughly 150,000 words or 500 pages of text. In practice, most API calls use far less. Sending full context every turn is the primary cause of ITPM exhaustion in agentic workflows.
Consumer Plan Usage Caps
Consumer plans on claude.ai use rolling message windows, not fixed daily counts. The exact number of messages varies because not all messages cost the same.
| Plan | Price | Window | Approx. Messages | Notes |
|---|---|---|---|---|
| Free | $0 | 5-hour | ~15-40 per window | Varies by demand |
| Pro | $20/mo | 5-hour | ~45 per window | ~5x free tier |
| Max 5x | $100/mo | Weekly | 5x Pro capacity | Includes Claude Code |
| Max 20x | $200/mo | Weekly | 20x Pro capacity | Includes Claude Code |
| Team | $25-30/seat | 5-hour | Similar to Pro | Admin + SSO features |
| Enterprise | Custom | Custom | Custom | Dedicated capacity |
Why message costs vary
The 30th message in a long conversation can cost 5-10x the first message. Each turn reprocesses the full context. Opus 4 messages consume 3-5x more quota than Sonnet 4. Starting fresh conversations resets this accumulation.
Claude Code Limits
Claude Code operates under three independent, overlapping constraints. The dashboard usage percentage reflects only one of these dimensions. Your API tier limits from above still apply as a separate ceiling.
In March 2026, Max 5x subscribers reported exhausting their rate limit in roughly 90 minutes during normal agentic workloads. Anthropic attributed this to a bug in limit enforcement and temporarily doubled off-peak usage limits through March 28 as mitigation.
Practical tips for Claude Code rate management:
Start fresh sessions
Long sessions accumulate context that inflates token consumption per turn. New sessions reset the context cost curve.
Use subagents for research
Delegate file exploration to subagents. They run in separate context windows and return only relevant results, saving 40%+ of input tokens.
Compact at 70%, not 90%
Proactive compaction before context fills keeps per-turn costs low. Waiting until 90% means several expensive turns before compaction triggers.
Choose models deliberately
Opus 4 consumes 3-5x more quota than Sonnet 4 per message. Use Sonnet for routine tasks and reserve Opus for complex reasoning.
Rate Limits vs Usage Limits
Anthropic enforces two separate ceilings. Confusing them is a common source of unexpected blocks.
Rate limits
Cap throughput per minute: RPM, ITPM, OTPM. You get a 429 error and can retry after a short wait. These are the limits discussed in most of this guide.
Usage limits
Cap total monthly spend. Free: $10/mo. Tier 1: $100/mo. Tier 2: $500/mo. Tier 3: $1,000/mo. Tier 4: $5,000/mo. Exceeding this blocks API access until the next billing cycle.
| Tier | Monthly Spend Cap | How to Increase |
|---|---|---|
| Free | $10 | Add payment method |
| Tier 1 | $100 | Spend $40 cumulative |
| Tier 2 | $500 | Spend $200 cumulative |
| Tier 3 | $1,000 | Spend $400 cumulative |
| Tier 4 | $5,000 | Contact Anthropic sales |
Handling 429 Errors
When you hit a rate limit, the API returns a 429 status with type: "rate_limit_error" and a retry-after header indicating seconds to wait.
Response headers to monitor
Every API response includes rate limit headers, even successful ones:
x-ratelimit-limit: Maximum allowed for this dimensionx-ratelimit-remaining: Current remaining capacityx-ratelimit-reset: When the limit fully replenishesretry-after: Seconds to wait (only on 429 responses)
Retry strategy
Implement exponential backoff with jitter. Start at 1 second, double each retry, add random jitter of 0-500ms. Do not retry in tight loops. The official Anthropic SDKs (Python and TypeScript) handle retries automatically with sensible defaults.
Production tip
Build adaptive throttling using x-ratelimit-remaining. When remaining capacity drops below 20%, slow your request rate proactively rather than waiting for 429 errors.
Staying Under Limits in Production
Most applications never need Tier 4. The right combination of caching, context management, and model selection keeps production workloads within Tier 2 limits.
Enable prompt caching
Cached input tokens are free against ITPM. With 80% cache hit rate, your effective limit is 5x nominal. This alone can eliminate the need to upgrade tiers.
Send less context
Most ITPM exhaustion comes from sending entire files when only a few functions are relevant. Send targeted context, not full repositories.
Use smaller models
Haiku 3.5 has 2x the ITPM of Sonnet at the same tier. Route classification, extraction, and simple tasks to Haiku. Reserve Sonnet and Opus for reasoning.
For agentic coding workflows specifically, the biggest optimization is reducing tokens per action. WarpGrep runs code searches in separate context windows, returning only relevant line ranges instead of full files. Morph Fast Apply compacts code edits so agents send diffs rather than complete file rewrites. When an agent uses 60% fewer tokens per action, it effectively gets 2.5x the rate limit headroom without changing tiers.
Frequently Asked Questions
What are Claude's API rate limits?
Claude API rate limits vary by tier. Tier 1 (after $5 cumulative spend) provides 50 RPM and 30,000 input tokens per minute for Sonnet. Tier 2 ($40 spend) jumps to 1,000 RPM and 450,000 ITPM. Tier 3 ($200) provides 2,000 RPM and 800,000 ITPM. Tier 4 ($400) reaches 4,000 RPM and 2,000,000 ITPM. Each model has independent limits.
How does Claude's token bucket algorithm work?
Anthropic uses a token bucket that replenishes continuously rather than resetting at fixed intervals. Your bucket capacity equals your per-minute limit. Tokens flow in at a constant rate, and each request removes tokens. Short bursts above sustained rate are allowed if you have accumulated capacity. When the bucket empties, you receive a 429 error.
Do cached tokens count against Claude rate limits?
No. With prompt caching, only uncached input tokens count toward ITPM. An application with 80% cache hit rate effectively gets 5x the nominal throughput. This is the single biggest optimization for high-volume API usage.
What are Claude Pro and Max plan usage limits?
Pro ($20/month) allows approximately 45 messages per 5-hour rolling window. Max 5x ($100/month) provides 5x Pro limits on weekly rolling windows. Max 20x ($200/month) provides 20x Pro limits. Message cost varies: the 30th message in a conversation can cost 5-10x the first due to full context reprocessing.
How do I fix Claude 429 rate limit errors?
Check the retry-after header for seconds to wait. Implement exponential backoff with jitter starting at 1 second. Long-term fixes: enable prompt caching, reduce context size, choose smaller models for simple tasks, and upgrade your API tier if needed.
What are Claude Code rate limits?
Claude Code operates under three independent, overlapping constraints. Your dashboard percentage reflects only one dimension. API tier limits still apply. In March 2026, Max 5x users reported limit exhaustion in roughly 90 minutes, which Anthropic attributed to a bug and temporarily mitigated by doubling off-peak limits.
What is the difference between rate limits and usage limits?
Rate limits cap throughput per minute (RPM, ITPM, OTPM) and result in temporary 429 errors. Usage limits cap total monthly spend ($10 for Free through $5,000 for Tier 4) and block API access entirely until the next billing cycle. They are enforced independently.
Reduce Token Consumption, Stretch Your Rate Limits
Morph Fast Apply and WarpGrep reduce the tokens coding agents consume per action by up to 60%, effectively multiplying your rate limit headroom.