Cerebras Pricing: 2,000 tok/s Inference at What Cost? (2026 Breakdown)

Cerebras inference runs Llama 3.3 70B at 2,300 tok/s and Qwen3 Coder 480B at 2,000 tok/s. Pricing starts at $0.10/MTok for Llama 3.1 8B, with a free tier of 1M tokens/day. Full model pricing, speed benchmarks, Groq/Fireworks comparison, and when the speed premium pays for itself in coding agent loops.

April 5, 2026 · 11 min read

Cerebras API Pricing by Model (April 2026)

All prices per million tokens (MTok). Cerebras runs open-source models only. No proprietary models, no fine-tuned variants (yet, though enterprise customers can bring custom weights). Pricing verified against cerebras.ai/pricing and Artificial Analysis as of April 2026.

$0.10
Llama 3.1 8B (per MTok, in/out)
$0.35 / $0.75
GPT-OSS 120B (input / output)
$2.00
Qwen3 Coder 480B (per MTok, in/out)
ModelInput (per MTok)Output (per MTok)Context Window
Llama 3.1 8B$0.10$0.1016K
GPT-OSS 120B$0.35$0.75131K
Qwen3 235B (A22B)$0.60$1.20262K
Qwen3 Coder 480B$2.00$2.00131K
GLM-4.7$2.25$2.75203K
K2 Think (32B)TBDTBD
DeepSeek R1 Distill 70BTBDTBD

The spread is 25x between the cheapest model (Llama 3.1 8B at $0.10) and the most expensive (GLM-4.7 at $2.25/$2.75). For coding workloads, Qwen3 Coder 480B at $2.00/MTok is the relevant benchmark. That is comparable to Claude Sonnet 4.6 at $3.00/$15.00, with the important caveat that Cerebras output tokens cost $2.00 versus Sonnet's $15.00.

Output tokens are the bargain

Coding agents generate substantial output: code, explanations, tool calls. On Cerebras, output costs the same as input for most models. On Anthropic, output costs 3-5x input. For output-heavy workloads, Cerebras' flat pricing structure is a material advantage.

Speed Benchmarks: Tokens Per Second

This is the headline. Cerebras does not compete on price alone. It competes on speed, and the numbers are not incremental improvements. They are a different order of magnitude from GPU-based providers.

2,500
Llama 4 Maverick 400B (tok/s)
2,300
Llama 3.3 70B (tok/s)
2,000
Qwen3 Coder 480B (tok/s)
969
Llama 3.1 405B (tok/s)
ModelParametersOutput SpeedTTFT
Llama 4 Maverick400B (MoE)~2,500 tok/s
Llama 4 Scout109B (MoE)~2,600 tok/s
Llama 3.3 70B70B~2,300 tok/s170ms
GPT-OSS 120B120B~1,700 tok/s280ms
Qwen3 Coder 480B480B (MoE)~2,000 tok/s
Qwen3 235B235B (MoE)~1,400 tok/s
GLM-4.7355B~1,100 tok/s640ms
K2 Think (32B)32B~2,000 tok/s
Llama 3.1 405B405B~969 tok/s240ms
Llama 3.1 8B8B~2,200 tok/s440ms
DeepSeek R1 Distill 70B70B~1,500 tok/s

For context: a typical GPU provider delivers 50-200 tok/s on a 70B model. Cerebras delivers 2,300. That is 10-40x faster. At 2,000 tok/s, 1,000 lines of JavaScript generate in about 4 seconds. The same generation takes 30 seconds on Gemini 2.5 Flash and 80 seconds on Claude 4 Sonnet.

TTFT caveat for large models

Time to first token on smaller models (Llama 3.3 70B: 170ms) is excellent. On larger models like Qwen3 Coder 480B, users report higher TTFT (several seconds) due to model loading overhead. Once generation starts, throughput is consistently fast, but the initial wait can add up across many short requests in an agent loop.

Why Wafer-Scale Is Fast

The speed comes from hardware architecture, not software tricks. GPU inference is bottlenecked by memory bandwidth: model weights live in HBM, and the GPU spends most of its time waiting for data to arrive. NVIDIA's H100 has about 3.35 TB/s of memory bandwidth. Fast, but still the limiting factor for inference.

Cerebras' WSE-3 (Wafer-Scale Engine, third generation) takes a different approach. Instead of a small chip that shuttles data from external memory, it is a single silicon wafer the size of a dinner plate: 46,255 mm², 4 trillion transistors, 900,000 AI cores. The key number: 44 GB of SRAM sits directly on the chip, co-located with compute. No off-chip memory fetch. No bandwidth wall.

900K
AI cores on a single wafer
44 GB
On-chip SRAM (vs 40 MB on H100)
21 PB/s
Memory bandwidth

The 44 GB of on-chip SRAM is roughly 1,000x more than an H100's on-chip memory. During inference, model parameters are already positioned next to the cores that need them. The result is 21 petabytes per second of aggregate memory bandwidth across the wafer, compared to 3.35 TB/s on an H100. That 6,000x bandwidth advantage is why a single Cerebras system outperforms racks of GPUs on inference throughput.

The tradeoff: 44 GB of SRAM constrains which models fit on a single chip. Dense models larger than roughly 70B parameters require model parallelism or mixture-of-experts architectures where active parameters stay within SRAM capacity. This is why Cerebras'fastest results come on MoE models (Llama 4 Maverick, Qwen3 Coder) where active parameters per forward pass are much smaller than total parameter count.

Free Tier and Paid Plans

Cerebras offers three API tiers: Free, Developer, and Enterprise. Plus two subscription plans (Cerebras Code) for coding-specific workloads.

TierPriceRate LimitModelsSupport
Free$030 RPM, 1M tok/dayLlama 3.3 70B, Qwen3 32B, Qwen3 235B, GPT-OSS 120BDiscord community
DeveloperPay-per-token (from $10)10x free tier limitsAll modelsStandard
EnterpriseContact salesHighest limits + dedicated queueAll models + custom weightsDedicated team + SLA

The free tier is genuine: 1 million tokens per day, no credit card, access to production models including Llama 3.3 70B and Qwen3 235B. For prototyping and evaluation, this is enough to run meaningful tests. The context window starts at 8,192 tokens on the free tier, expandable to 128K on request. Developer tier unlocks up to 262K context (model-dependent) and 10x rate limits.

Free tier context limit

The 8,192-token context limit on the free tier is restrictive for coding agents, which routinely need 32K-128K tokens of context. Evaluation of Cerebras for agent workloads requires at minimum the Developer tier.

Cerebras Code: $50 and $200 Subscription Plans

Separate from the API, Cerebras sells coding-specific subscription plans that work with popular IDE agents like Cline, RooCode, OpenCode, and Crush.

Code Pro — $50/month

Up to 24 million tokens per day ($48/day of API value). 1,000 messages/day. 1M TPM rate limit. Access to top open-source coding models including Qwen3 Coder 480B at up to 2,000 tok/s with 131K context. Suited for indie developers and simple agentic workflows.

Code Max — $200/month

Up to 120 million tokens per day ($240/day of API value). 5,000 messages/day. 1.5M TPM rate limit. Same model access as Pro with higher throughput limits. Built for full-time development, multi-agent systems, and heavy code refactoring.

The value proposition is straightforward. If you consume more than $50/month of API tokens, Code Pro saves money. At $2.00/MTok on Qwen3 Coder, $50 buys 25 million tokens via the API. Code Pro gives you 24 million tokens per day. That is roughly 720 million tokens per month for $50, a massive discount over pay-per-token rates.

The catch: you are locked to Cerebras' model selection. No Claude, no GPT-4o, no proprietary models. If your workflow requires mixing proprietary and open-source models, the subscription does not replace your other providers. It supplements them.

Cerebras vs Groq: Speed and Price

Groq is the other name in fast inference, built on custom LPU (Language Processing Unit) silicon. Both companies position themselves as alternatives to GPU-based inference. But the performance profiles differ.

FactorCerebrasGroq
Llama 3.3 70B speed~2,300 tok/s~350-500 tok/s
GPT-OSS 120B speed~1,700 tok/s~400-500 tok/s
Llama 3.1 8B input price$0.10/MTok$0.05/MTok
Llama 3.1 8B output price$0.10/MTok$0.08/MTok
GPT-OSS 120B input price$0.35/MTok$0.15/MTok
GPT-OSS 120B output price$0.75/MTok$0.60/MTok
Llama 3.3 70B input price~$0.60/MTok$0.59/MTok
Llama 3.3 70B output price~$0.60/MTok$0.79/MTok
Free tier1M tok/dayLimited free access
Prompt cachingYesYes (50% off cached input)
ArchitectureWafer-scale (WSE-3)LPU (Language Processing Unit)

Groq is cheaper per token on most models. Cerebras is 4-6x faster on throughput. The question is which metric matters for your workload.

Artificial Analysis benchmarks put the price-performance comparison at roughly $0.00017 per output token per second on Cerebras versus $0.00135 on Groq. Each token delivered by Cerebras costs about 8x less when normalized for speed. If your workload is throughput-sensitive (generating large amounts of code, running agent loops, batch processing), Cerebras' higher per-token price is offset by its speed advantage. If you are running short, latency-sensitive requests where time-to-first-token matters more than sustained throughput, Groq is competitive.

Groq batch discount

Groq's Batch API provides 50% off for non-urgent requests. If throughput is not critical and you can tolerate async processing, Groq's batch pricing undercuts Cerebras on per-token cost. For real-time agent loops, batch processing is not an option.

Cerebras vs Fireworks vs Together AI

Fireworks and Together AI run inference on GPU clusters (A100, H100, B200). They compete on price and model breadth rather than raw speed. Both support a wider range of models than Cerebras, including fine-tuned variants and custom deployments.

FactorCerebrasFireworks AITogether AIGroq
70B model speed~2,300 tok/s~100-200 tok/s~100-200 tok/s~350-500 tok/s
70B input price~$0.60/MTok$0.90/MTok~$0.80/MTok$0.59/MTok
120B input price$0.35/MTok$0.15/MTok$0.15/MTok
Custom fine-tunesEnterprise onlyYesYesNo
Model breadth~10 models50+ models50+ models~10 models
Free tier1M tok/dayLimitedLimitedLimited
ArchitectureWSE-3 (custom)GPU (A100/H100/B200)GPU (A100/H100)LPU (custom)

The pattern is clear. Cerebras and Groq trade price for speed using custom silicon. Fireworks and Together AI offer broader model selection and fine-tuning on GPU infrastructure at competitive per-token rates but 10-20x lower throughput. For production systems that need specific fine-tuned models, Fireworks or Together AI may be the only option. For speed-critical workloads on supported open-source models, Cerebras is in a category of its own.

When Speed Pays for Itself: Coding Agents

A coding agent does not make one LLM call. It makes dozens. Read a file, generate code, run tests, read the error, fix the code, run tests again, read more files, refactor. Each step waits for the previous step's output before it can proceed. This is a serial workload where inference speed directly maps to developer wait time.

Consider a 10-step agent loop where each step generates 2,000 output tokens:

ProviderSpeedTime per Step10-Step Loop
Cerebras2,000 tok/s1.0s10 seconds
Groq400 tok/s5.0s50 seconds
GPU provider (fast)200 tok/s10.0s100 seconds
GPU provider (standard)80 tok/s25.0s250 seconds

10 seconds versus 4 minutes. That is not a marginal improvement. It changes how developers interact with agents. At 10 seconds, you stay in flow. At 4 minutes, you context-switch to something else and lose the thread.

The cost comparison for this loop (20,000 total output tokens on a 70B model): Cerebras at ~$0.60/MTok output = $0.012. Groq at $0.79/MTok output = $0.016. A GPU provider at $0.90/MTok = $0.018. The price difference is negligible. The time difference is 10x.

10s
10-step loop on Cerebras
50s
Same loop on Groq
250s
Same loop on standard GPU

Where speed doesn't matter

Batch processing, background code review, nightly test generation, documentation updates. Any workload where the developer is not waiting on the result. For these, optimize on price. Use Groq batch (50% off) or a GPU provider. Reserve Cerebras for interactive agent sessions where a human is in the loop.

Model Availability and Limits

Cerebras runs open-source models. The catalog is smaller than GPU-based providers but covers the models that matter for coding:

Production Models

Llama 3.1 8B, Llama 3.3 70B, Llama 4 Scout (109B MoE), Llama 4 Maverick (400B MoE), OpenAI GPT-OSS 120B, DeepSeek R1 Distill Llama 70B. Stable, fully supported, production-ready rate limits.

Preview Models

Qwen3 235B, Qwen3 Coder 480B, GLM-4.7, K2 Think (32B), K2 Think V2 (70B). Available for evaluation. May be discontinued on short notice. Not recommended for production dependencies.

Notable gaps: no Claude, no GPT-4o/5.x, no Gemini. No fine-tuned model hosting on the Developer tier. If your agent workflow requires a proprietary model for certain tasks, Cerebras cannot be your only provider. The typical production setup routes speed-critical open-source calls to Cerebras and proprietary model calls to Anthropic, OpenAI, or Google.

Context windows vary significantly: Llama 3.1 8B supports 16K tokens, GPT-OSS 120B supports 131K, Qwen3 235B supports 262K. For coding agents that accumulate large contexts, confirm the context limit of your target model before building around it.

Frequently Asked Questions

How much does Cerebras inference cost?

Pricing starts at $0.10 per million tokens for Llama 3.1 8B. GPT-OSS 120B costs $0.35 input / $0.75 output. Qwen3 Coder 480B costs $2.00 per MTok. GLM-4.7 costs $2.25 input / $2.75 output. There is a free tier with 1M tokens per day, no credit card required.

How fast is Cerebras inference?

2,300 tok/s on Llama 3.3 70B. 2,000 tok/s on Qwen3 Coder 480B. 2,500 tok/s on Llama 4 Maverick 400B. These are 10-20x faster than GPU-based providers and 4-6x faster than Groq.

Is Cerebras faster than Groq?

On sustained throughput, yes, by roughly 4-6x on the same models. Independent benchmarks from Artificial Analysis confirm this. Groq has competitive time-to-first-token latency, but Cerebras dominates tokens-per-second output.

What models does Cerebras support?

Open-source models only: Llama 3.x/4.x series, OpenAI GPT-OSS 120B, Qwen3 32B/235B/Coder 480B, GLM-4.7, DeepSeek R1 Distill 70B, K2 Think. No proprietary models. Enterprise customers can bring custom weights. Full model list.

What is the Cerebras free tier?

1 million tokens per day, no credit card. Access to Llama 3.3 70B, Qwen3 32B, Qwen3 235B, and GPT-OSS 120B. 30 requests per minute. Context starts at 8,192 tokens (expandable to 128K on request). Good for evaluation, too limited for production agent workloads.

How does Cerebras Code Pro compare to the API?

Code Pro ($50/month) includes up to 24 million tokens per day. At $2.00/MTok API pricing for Qwen3 Coder, $50 buys 25M tokens total via the API. Code Pro gives you that amount every day. The subscription is dramatically cheaper for heavy usage. Code Max ($200/month) provides 120M tokens per day. Both work with Cline, RooCode, OpenCode, and similar IDE agents.

Why is Cerebras inference so fast?

The WSE-3 chip packs 900,000 cores and 44 GB of SRAM on a single wafer. Model weights live in on-chip memory next to compute cores, eliminating the memory bandwidth bottleneck that limits GPU inference. The chip delivers 21 petabytes per second of memory bandwidth, roughly 6,000x more than an H100. More bandwidth, less waiting, faster token generation.

Should I use Cerebras for coding agents?

If your agent workflow runs on open-source models and speed is a priority, yes. Cerebras is the fastest inference provider available. The 10x speed advantage over GPU providers means agent loops complete in seconds instead of minutes. The limitation is model selection: no proprietary models, no fine-tuning on Developer tier. Most production agent systems use Cerebras for speed-critical calls and a proprietary provider for capability-critical calls. Morph helps route between providers based on task requirements.

Build faster coding agents with Morph

Morph routes between inference providers based on speed, cost, and capability. Use Cerebras for throughput-critical steps, proprietary models for capability-critical steps, and Morph Compact to cut context size by 50-70% across all providers.