Every Major LLM API, Sorted by Input Cost (April 2026)
Prices per million tokens. Input is what you send. Output is what the model returns. Output typically costs 2-8x more than input. For coding agents and long-context workloads, input cost dominates because you send far more tokens than you receive.
| Model | Provider | Input/MTok | Output/MTok | Tier |
|---|---|---|---|---|
| GPT-OSS 20B | OpenAI | $0.03 | $0.10 | Budget |
| Llama 3.1 8B Instant | Groq | $0.05 | $0.08 | Budget |
| GPT-5 Nano | OpenAI | $0.05 | $0.40 | Budget |
| Amazon Nova Lite | AWS Bedrock | $0.06 | $0.24 | Budget |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | Budget | |
| GPT-4.1 Nano | OpenAI | $0.10 | $0.40 | Budget |
| Llama 4 Scout (17Bx16E) | Groq | $0.11 | $0.34 | Budget |
| GPT-OSS 120B | OpenAI | $0.15 | $0.60 | Budget |
| Mistral Small 3.1 | Mistral | $0.20 | $0.60 | Budget |
| GPT-5.4 Nano | OpenAI | $0.20 | $1.25 | Budget |
| GPT-4.1 Mini | OpenAI | $0.20 | $0.80 | Budget |
| Grok 4.1 Fast | xAI | $0.20 | $0.50 | Budget |
| Qwen3 Coder 480B A35B | Alibaba | $0.22 | $0.90 | Budget |
| Gemini 3.1 Flash-Lite | $0.25 | $1.50 | Budget | |
| DeepSeek V3.2 (Chat) | DeepSeek | $0.28 | $0.42 | Budget |
| DeepSeek V3.2 (Reasoner) | DeepSeek | $0.28 | $0.42 | Budget |
| Qwen3 32B | Groq | $0.29 | $0.59 | Budget |
| Gemini 2.5 Flash | $0.30 | $2.50 | Budget | |
| Codestral 2508 | Mistral | $0.30 | $0.90 | Budget |
| DeepSeek V4 | DeepSeek | $0.30 | $0.50 | Budget |
| Gemini 3 Flash | $0.50 | $3.00 | Budget | |
| o4-mini | OpenAI | $0.55 | $2.20 | Budget |
| DeepSeek R1 | DeepSeek | $0.55 | $2.19 | Budget |
| Llama 3.3 70B Versatile | Groq | $0.59 | $0.79 | Budget |
| Qwen3 Coder Plus | Alibaba | $0.65 | $3.25 | Mid |
| GPT-5.4 Mini | OpenAI | $0.75 | $4.50 | Mid |
| Amazon Nova Pro | AWS Bedrock | $0.80 | $3.20 | Mid |
| Mistral Medium 3 | Mistral | $1.00 | $3.00 | Mid |
| Claude Haiku 4.5 | Anthropic | $1.00 | $5.00 | Mid |
| o3-mini | OpenAI | $1.10 | $4.40 | Mid |
| Gemini 2.5 Pro | $1.25 | $10.00 | Mid | |
| GPT-5 | OpenAI | $1.25 | $10.00 | Mid |
| GPT-5.2 | OpenAI | $1.75 | $14.00 | Mid |
| Gemini 3.1 Pro | $2.00 | $12.00 | Mid | |
| Mistral Large 3 | Mistral | $2.00 | $6.00 | Mid |
| o3 | OpenAI | $2.00 | $8.00 | Mid |
| GPT-4.1 | OpenAI | $2.00 | $8.00 | Mid |
| GPT-5.4 | OpenAI | $2.50 | $15.00 | Premium |
| Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 | Premium |
| Grok 4 | xAI | $3.00 | $15.00 | Premium |
| Claude Opus 4.6 | Anthropic | $5.00 | $25.00 | Premium |
| GPT-5.2 Pro | OpenAI | $10.50 | $84.00 | Premium |
| Claude Opus 4.1 (legacy) | Anthropic | $15.00 | $75.00 | Premium |
| o3 Pro | OpenAI | $20.00 | $80.00 | Premium |
| GPT-5.4 Pro | OpenAI | $30.00 | $180.00 | Premium |
| o1-pro (legacy) | OpenAI | $150.00 | $600.00 | Premium |
Reading this table
Input/MTok = price per million input tokens (your prompts, system messages, context). Output/MTok = price per million output tokens (model responses). A typical coding agent session sends 5-10x more input tokens than it receives in output. For chat applications, the ratio is closer to 1:1.
Budget Tier: Under $1 per Million Input Tokens
This is where most production workloads should start. Models in this range handle classification, extraction, simple code generation, summarization, and formatting. The quality gap between a $0.20/MTok model and a $3.00/MTok model is smaller than the 15x price difference suggests.
DeepSeek V3.2: $0.28/$0.42
The price-performance champion. Competitive with mid-tier models on coding and general benchmarks, at roughly 10x less than GPT-5.4. Cache hits drop input to $0.028/MTok. Thinking mode (reasoner) is the same price. 128K context. The main trade-off: DeepSeek's API routes through China, which matters for some compliance requirements.
Gemini 2.5 Flash-Lite: $0.10/$0.40
The cheapest option from a major US provider. Google offers a free tier for prototyping. Context caching at $0.01/MTok makes repeated prompts nearly free. Performance is solid for simple tasks but drops off on complex reasoning compared to Flash ($0.30/$2.50) or Pro ($1.25/$10.00).
GPT-4.1 Nano: $0.10/$0.40
OpenAI's cheapest current-gen model. Good enough for linting, formatting, classification, and short code completions. Not competitive on multi-step reasoning or complex generation. Cached input at $0.01/MTok. Batch at $0.05/$0.20. For high-volume simple tasks, this is the OpenAI pick.
Mistral Small 3.1: $0.20/$0.60
A strong small model from a European provider (data stays in EU if that matters). Competitive with GPT-4.1 Mini on most benchmarks. No prompt caching discount, but the base price is already low. Good default for teams that want a non-US, non-China option.
| Model | Input/MTok | Output/MTok | Cache Hit Input | Best For |
|---|---|---|---|---|
| DeepSeek V3.2 | $0.28 | $0.42 | $0.028 | General coding, reasoning |
| DeepSeek V4 | $0.30 | $0.50 | $0.03 | Latest DeepSeek flagship |
| DeepSeek R1 | $0.55 | $2.19 | $0.14 | Budget reasoning (think + answer) |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | $0.01 | High-volume simple tasks |
| GPT-5.4 Nano | $0.20 | $1.25 | $0.02 | Cheap OpenAI flagship-family |
| GPT-4.1 Nano | $0.10 | $0.40 | $0.01 | Classification, formatting |
| Grok 4.1 Fast | $0.20 | $0.50 | N/A | Cheap xAI option |
| Mistral Small 3.1 | $0.20 | $0.60 | N/A | EU data residency |
| o4-mini | $0.55 | $2.20 | $0.055 | Budget reasoning model |
| Llama 3.1 8B (Groq) | $0.05 | $0.08 | N/A | Lowest possible cost |
| Amazon Nova Lite | $0.06 | $0.24 | N/A | AWS-native workloads |
The surprise in this tier: OpenAI's reasoning model o4-mini at $0.55/$2.20 is cheaper than many general-purpose mid-tier models. It replaced o3-mini and outperforms it on most benchmarks. For tasks that need structured reasoning on a budget, o4-mini undercuts everything except DeepSeek's reasoner mode.
Hidden cost: reasoning tokens
Reasoning models (o4-mini, o3, DeepSeek Reasoner) generate internal chain-of-thought tokens billed as output. A short visible answer might use 2,000 output tokens, but the model burned 10,000 reasoning tokens behind the scenes. Your actual output cost can be 5-10x what the visible response length implies. Factor this into any cost comparison.
Mid Tier: $1-5 per Million Input Tokens
The workhorses. These models handle complex code generation, multi-step reasoning, long-form writing, and most agentic workflows. The price difference between $1.25/MTok (Gemini 2.5 Pro) and $3.00/MTok (Claude Sonnet 4.6) matters at scale but both are viable for production.
| Model | Input/MTok | Output/MTok | Context Window | Notes |
|---|---|---|---|---|
| Claude Haiku 4.5 | $1.00 | $5.00 | 200K | Cheapest Anthropic model worth using for code |
| Gemini 2.5 Pro | $1.25 | $10.00 | 1M | Best value frontier-class model |
| GPT-5 | $1.25 | $10.00 | 128K | OpenAI base flagship |
| GPT-5.2 | $1.75 | $14.00 | 128K | Retiring June 2026 |
| Gemini 3.1 Pro | $2.00 | $12.00 | 1M | Latest Google flagship |
| Mistral Large 3 | $2.00 | $6.00 | 128K | Cheapest output in mid-tier |
| o3 | $2.00 | $8.00 | 200K | Best reasoning per dollar |
| GPT-4.1 | $2.00 | $8.00 | 1M | Workhorse for code, instruction following |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 1M | Best coding model per many benchmarks |
Two models stand out. Gemini 2.5 Pro at $1.25/$10.00 offers frontier quality at nearly half the price of Claude Sonnet 4.6. Its 1M context window matches Claude's and costs less. The catch: Gemini is newer for coding agent use cases and the ecosystem (tool use, function calling) is less mature than Anthropic's or OpenAI's.
Mistral Large 3 at $2.00/$6.00 has the cheapest output in this tier. If your workload is output-heavy (long-form generation, detailed code), Mistral Large saves 60% on output versus Claude Sonnet ($6 vs $15) and 25% versus o3 ($6 vs $8).
Inference Providers: Same Model, Different Price
Open-source models (Llama, Qwen, Mixtral, DeepSeek) run on multiple providers. The same model at the same quality can cost 2-5x more depending on where you host it.
| Provider | Input/MTok | Output/MTok | Speed |
|---|---|---|---|
| Groq | $0.11 (Scout) | $0.34 | Fastest TTFB |
| Together AI | $0.27 | $0.85 | Fast |
| Fireworks AI | $0.27 | $0.85 | Fast |
| Cerebras | Contact | Contact | Highest throughput |
Groq: Lowest Latency
Custom LPU silicon built for fast inference. Llama 3.1 8B at $0.05/$0.08 is the cheapest hosted inference available. 241+ tok/s on 70B models. Trade-off: limited model selection compared to Together or Fireworks. Prompt caching available on select models at 50% off.
Together AI: Widest Selection
15+ open-source models including Llama, DeepSeek, Qwen, Mistral, and Gemma. Competitive pricing on mid-size models. Cheaper than Fireworks on 39 of 80 shared models. Best for teams that want one API for many open-source models.
Fireworks AI: Flexible Deployment
Serverless starting at $0.10/MTok for small models. On-demand GPUs from $2.90/hr (A100) to $9.00/hr (B200) for dedicated capacity. Good middle ground between pure serverless and self-hosting.
Cerebras: Throughput King
Wafer-scale engine delivers 1,800 tok/s on Llama 3.1 8B and 969 tok/s on 405B. Free tier available. Enterprise pricing on request. Best for batch workloads where throughput matters more than per-token cost.
For most teams: start with Groq or Together AI for open-source models. If you need dedicated capacity or fine-tuned models, look at Fireworks on-demand or self-hosting.
Batch Discounts and Prompt Caching
Raw per-token pricing is the sticker price. Actual cost depends on whether you use the discount stack every major provider offers.
| Provider | Batch Discount | Cache Hit Discount | Combined Max Savings |
|---|---|---|---|
| Anthropic | 50% off (24h async) | 90% off input | 95% |
| OpenAI | 50% off (24h async) | 90% off input | 95% |
| 50% off (Flex mode) | 90% off input | 95% | |
| DeepSeek | N/A | 90% off input | 90% |
| Mistral | N/A | N/A | Base price only |
| Groq | N/A | 50% off (select models) | 50% |
The math on stacking. Take Claude Sonnet 4.6 at $3.00/MTok input:
| Configuration | Price per MTok | Savings |
|---|---|---|
| Standard | $3.00 | 0% |
| Prompt cache hit | $0.30 | 90% |
| Batch only | $1.50 | 50% |
| Batch + cache hit | $0.15 | 95% |
OpenAI's GPT-5.4 follows the same pattern. Standard at $2.50, cached at $0.25, batch at $1.25, batch + cached at $0.13. The effective price of a frontier model with full discount stack is cheaper than the sticker price of most budget models.
DeepSeek's stealth advantage
DeepSeek V3.2 cache hits cost $0.028/MTok. DeepSeek V4 cache hits cost $0.03/MTok. Both are cheaper than any other provider's cached input price, including Gemini Flash-Lite. If your workload has high cache hit rates (agent sessions with repeated system prompts and file context), DeepSeek's effective input cost is essentially free.
Quality-Adjusted Cost: Cheap but Bad = Expensive
A $0.05/MTok model that fails 40% of tasks and needs 3 retries per success costs more than a $3.00/MTok model that succeeds on the first attempt. The metric that matters is cost per completed task, not cost per token.
| Model | Price/MTok (blended) | Avg Tokens/Task | Success Rate | Effective Cost/Task |
|---|---|---|---|---|
| Llama 3.1 8B | $0.065 | 15K | ~35% | $0.28 (2.9 attempts) |
| GPT-4.1 Nano | $0.25 | 12K | ~55% | $0.55 (1.8 attempts) |
| DeepSeek V3.2 | $0.35 | 10K | ~75% | $0.47 (1.3 attempts) |
| Gemini 2.5 Flash | $1.40 | 8K | ~80% | $1.40 (1.25 attempts) |
| Claude Sonnet 4.6 | $9.00 | 6K | ~92% | $5.87 (1.09 attempts) |
| Claude Opus 4.6 | $15.00 | 5K | ~95% | $7.89 (1.05 attempts) |
DeepSeek V3.2 at $0.47 per completed task is the value sweet spot. It is 12x cheaper per completed task than Claude Sonnet, with a 75% single-attempt success rate that is good enough for most automated workflows with retry logic. Claude Sonnet's 92% success rate matters when retries are expensive (user-facing, time-sensitive, or when errors compound in multi-step agents).
The cheapest models (Llama 3.1 8B, GPT-4.1 Nano) look cheap per token but burn through tokens on retries. They are cost-effective only for tasks where they reliably succeed on the first attempt: classification, extraction, formatting, simple completions. Using them for complex code generation is a false economy.
Local Inference via Ollama: $0 Per Token
Local inference eliminates per-token costs entirely. You pay for hardware upfront and electricity ongoing. Ollama, vLLM, and llama.cpp make self-hosting straightforward.
| Factor | Local (Mac Studio M4) | API (DeepSeek V3.2) | API (GPT-5.4) |
|---|---|---|---|
| Hardware cost | $4,000-8,000 | $0 | $0 |
| Monthly electricity | ~$15 | $0 | $0 |
| Cost at 10M tok/month | $15 + amortized HW | ~$4 | ~$35 |
| Cost at 100M tok/month | $15 + amortized HW | ~$40 | ~$350 |
| Cost at 1B tok/month | $15 + amortized HW | ~$400 | ~$3,500 |
| Best model available | Llama 3.1 70B | DeepSeek V3.2 | GPT-5.4 |
| Quality ceiling | 70-85% of frontier | ~85% of frontier | Frontier |
Break-even is roughly 50 million tokens per month, or about 1.7M tokens per day. Below that, APIs are cheaper. Above that, local inference wins on raw cost, though you lose access to frontier-quality models. A Mac Studio running Llama 3.1 70B under full GPU load draws about 60W, costing under $15/month in most US electricity markets.
The practical limitation: local models top out at 70B parameters on consumer hardware (120B with quantization tricks). Tasks that require frontier reasoning (complex architecture, nuanced debugging) still need API calls to Claude Opus, GPT-5.4, or Gemini 2.5 Pro. The hybrid approach works best: route simple tasks to local models, escalate hard problems to API.
How Morph Compact Makes Every LLM Cheaper
Every optimization above reduces the price per token. Morph Compact reduces the number of tokens. These are complementary strategies, and they multiply.
Compact compresses LLM context by 50-70% at 33,000 tokens per second. The compression is extractive: every surviving line is character-for-character from the original input. No paraphrasing, no synthesis, no hallucination risk. The model decides which lines are relevant and discards the rest.
| Model | Standard Cost (100K input) | After 60% Compaction (40K input) | Savings |
|---|---|---|---|
| Llama 3.1 8B (Groq) | $0.005 | $0.002 | $0.003 |
| DeepSeek V3.2 | $0.028 | $0.011 | $0.017 |
| Gemini 2.5 Flash | $0.030 | $0.012 | $0.018 |
| GPT-5.4 Nano | $0.020 | $0.008 | $0.012 |
| GPT-5.4 | $0.250 | $0.100 | $0.150 |
| Claude Sonnet 4.6 | $0.300 | $0.120 | $0.180 |
| Claude Opus 4.6 | $0.500 | $0.200 | $0.300 |
| GPT-5.4 Pro | $3.000 | $1.200 | $1.800 |
The savings per request look modest at the cheap end. But coding agents send 20-100 requests per session, each carrying accumulated context. A 30-turn agent session on Claude Sonnet 4.6 with 200K tokens of context per turn:
At 10 sessions per day, that is $108/day saved, or $3,240/month. The savings scale linearly with token volume and model cost. Compaction is most valuable on expensive models (where each saved token is worth more) and on long-running agent sessions (where context accumulates the most).
| Strategy | Sonnet 4.6 Input | GPT-5.4 Input | DeepSeek V3.2 Input |
|---|---|---|---|
| Base price | $3.00 | $2.50 | $0.28 |
| + Prompt caching | $0.30 | $0.25 | $0.028 |
| + Batch API | $0.15 | $0.13 | N/A |
| + 60% compaction | $0.06 effective | $0.05 effective | $0.011 effective |
Claude Sonnet 4.6, a frontier coding model, at $0.06 effective per MTok of original context. That is cheaper than the sticker price of Llama 3.1 8B on Groq ($0.05/$0.08), while delivering 92% first-attempt success rate versus 35%. The discount stack inverts the pricing hierarchy.
Frequently Asked Questions
What is the cheapest LLM API in 2026?
The absolute cheapest is Llama 3.1 8B on Groq at $0.05/$0.08 per million tokens. Amazon Nova Lite at $0.06/$0.24 is close. Among first-party providers, DeepSeek V3.2 at $0.28/$0.42 offers the best quality-to-cost ratio. Gemini 2.5 Flash-Lite at $0.10/$0.40 is cheapest from a major US provider. GPT-4.1 Nano at $0.10/$0.40 is cheapest from OpenAI.
How much does GPT-5.4 cost?
GPT-5.4 costs $2.50 input / $15.00 output per million tokens. Cached input at $0.25. Batch at $1.25/$7.50. GPT-5.4 Mini is $0.75/$4.50. GPT-5.4 Nano is $0.20/$1.25. GPT-5.4 Pro (the expensive one) is $30.00/$180.00.
Is DeepSeek cheaper than ChatGPT?
Yes. DeepSeek V3.2 at $0.28/$0.42 is 9x cheaper on input and 36x cheaper on output compared to GPT-5.4. DeepSeek V4 at $0.30/$0.50 is the latest flagship and still 8x cheaper on input. Cache hits at $0.028-$0.03/MTok are the cheapest cached input from any provider. Quality is competitive for coding and general tasks but below frontier models on the hardest reasoning benchmarks. DeepSeek R1 at $0.55/$2.19 offers chain-of-thought reasoning at a fraction of o3's cost ($2.00/$8.00).
What is the cheapest LLM for coding?
Dedicated coding models: Qwen3 Coder 480B A35B at $0.22/$0.90. Codestral at $0.30/$0.90. DeepSeek V3.2 in reasoner mode at $0.28/$0.42. General models that code well at low cost: GPT-4.1 Mini at $0.20/$0.80, o4-mini at $0.55/$2.20. For the best coding quality, Claude Sonnet 4.6 at $3.00/$15.00 leads most benchmarks but costs 10x more.
How do batch API and prompt caching reduce costs?
Batch APIs (OpenAI, Anthropic, Google) process requests asynchronously within 24 hours at 50% off. Prompt caching stores repeated context so subsequent requests pay 10% of standard input price (Anthropic, OpenAI, Google). These stack: batch + cache hit gives 95% savings on input. DeepSeek's cache hit price of $0.028/MTok makes it the cheapest cached input from any provider.
Is running a local LLM with Ollama cheaper than an API?
Above 50 million tokens per month, yes. A Mac Studio M4 running Llama 3.1 70B costs ~$15/month in electricity. Below 50M tokens, APIs are cheaper because you avoid the $4,000-8,000 hardware investment. Quality ceiling is 70-85% of frontier models. Best used in a hybrid setup: local for simple tasks, API for complex reasoning.
How does Morph Compact make LLMs cheaper?
Compact compresses context by 50-70% before sending it to any LLM. A 100K prompt compacted to 40K costs 60% less. The compression is extractive (verbatim lines, no paraphrasing) with zero hallucination risk. Combined with prompt caching and batch processing, it brings Claude Sonnet 4.6's effective input cost to $0.06/MTok, cheaper than the sticker price of most budget models. See how Compact works.
Cut your LLM API costs by 50-70%
Morph Compact compresses context before it reaches any model. Fewer tokens, same quality, lower bill. Works with every provider: OpenAI, Anthropic, Google, DeepSeek, or your own self-hosted models.