LLM API Providers (2026): 12 APIs Compared by Price per 1M Tokens, Rate Limits, and Context

Pricing for 12 LLM API providers, verified June 28, 2026: GPT-5.5 $5/$30, Claude Opus 4.8 $5/$25, Gemini 3.1 Pro $2/$12, GLM-5.2 $1.40/$4.40, DeepSeek V4 Flash $0.14/$0.28 per 1M tokens. Plus rate-limit tiers, OpenAI-compatibility matrix, free tiers, and SWE-bench scores.

June 28, 2026 · 1 min read
LLM API Providers (2026): 12 APIs Compared by Price per 1M Tokens, Rate Limits, and Context
Quick answer

The cheapest LLM API is DeepSeek V4 Flash at $0.14/$0.28 per 1M tokens. The best value is MiniMax M3 ($0.60/$2.40), the cheapest model scoring above 80% on SWE-bench Verified. The top available coding models are Claude Opus 4.8 (88.6%, $5/$25) and GPT-5.5 (88.7%, $5/$30), after Anthropic's Fable 5 (95.0%) was suspended in June 2026. Most production systems do not pick one. They route each request to the cheapest model that can handle it. See the Morph model router and the model list at Morph models.

What Is an LLM API

An LLM API is an HTTP interface to a hosted large language model. You send a prompt (plus optional tools, images, or files) to an endpoint such as /v1/chat/completions and receive generated tokens back, billed per million tokens (MTok) of input and output. It replaces running model weights on your own GPUs with a metered service: no infrastructure, no model updates to manage, pay only for tokens used.

Every provider on this page works that way. The differences are price, rate limits, context window, and model quality, and they are large. As of June 28, 2026, output tokens cost between $0.28/M (DeepSeek V4 Flash) and $30/M among general flagships (GPT-5.5), with OpenAI's GPT-5.5-pro tier reaching $180/M. Five models sit within 0.4 points of each other on SWE-bench Verified (80.2 to 80.6 percent) while spanning a 5x price range.

107x
Output price spread ($0.28 to $30 per MTok, general flagships)
1M
Context window now standard on flagships
88.6%
Top available SWE-bench Verified (Claude Opus 4.8)
$0
GLM-4.7-Flash API price (free)
All prices verified June 28, 2026

Every number below comes from the provider's official pricing or docs page as of June 28, 2026. Updated June 28: Z.AI refreshed to GLM-5.2 ($1.40/$4.40), Alibaba to Qwen3.7 Max and Qwen3.6 Plus, MiniMax M3 to its standard $0.60/$2.40, GPT-5.6's gated preview added, and Claude Fable 5's June 12 suspension reflected across the pricing and benchmark tables. LLM prices change quarterly; check the linked provider before committing to a volume contract.

Pricing Table: Flagship and Cheapest Model per Provider

Prices are $ per 1M tokens, input / output, standard (non-batch) tier. Context is the maximum input window.

ProviderModelInput/MTokOutput/MTokContext
OpenAIGPT-5.5$5.00$30.001M
AnthropicClaude Opus 4.8$5.00$25.001M
OpenAIGPT-5.4$2.50$15.001M
AnthropicClaude Sonnet 4.6$3.00$15.001M
OpenAIGPT-5.3-Codex$1.75$14.00400K
GoogleGemini 3.1 Pro (preview)$2.00$12.001M
GoogleGemini 3.5 Flash$1.50$9.00n/a
AnthropicClaude Haiku 4.5$1.00$5.00200K
OpenAIGPT-5.4-mini$0.75$4.50400K
Z.AIGLM-5.2$1.40$4.401M
MoonshotKimi K2.6$0.95$4.00256K
AlibabaQwen3.7 Max$1.25$3.751M
AlibabaQwen3.6 Plus$0.50$3.001M
GoogleGemini 3 Flash (preview)$0.50$3.00n/a
MiniMaxM3 (standard)$0.60$2.401M
GoogleGemini 3.1 Flash-Lite$0.25$1.50n/a
OpenAIGPT-5.4-nano$0.20$1.25n/a
DeepSeekV4 Pro$0.435$0.871M
DeepSeekV4 Flash$0.14$0.281M
Morphmorph-dsv4flash (DeepSeek V4 Flash, 16-bit)$0.139$0.2781M
Z.AIGLM-4.7-FlashFreeFreen/a

Notes that change effective cost: Gemini 3.1 Pro rises to $4/$18 for prompts over 200K tokens. MiniMax M3's standard rate is $0.60/$2.40, with a launch promo halving it to $0.30/$1.20. Anthropic charges the same per-token rate at any context length (a 900K-token request bills like a 9K one), but models from Opus 4.7 onward use a new tokenizer that can produce up to 35% more tokens for the same text, which inflates effective per-request cost in cross-provider comparisons.

Batch and cache discounts: OpenAI cached input is 10x cheaper ($0.50/M on GPT-5.5). Anthropic batch is 50% off and cache reads are 0.1x base input. Gemini batch is half price. GLM-5.2 cached input is $0.26/M. DeepSeek cache hits drop input to about $0.0036/M, roughly 40x cheaper. If your workload re-sends the same system prompt or file context, the cache column matters more than the headline price.

Provider-by-Provider Breakdown

OpenAI

GPT-5.5 ($5/$30, 1M context, 128K max output) is the GA flagship and scores 88.7% on SWE-bench Verified. GPT-5.4 ($2.50/$15) covers most workloads at half the price, and GPT-5.4-nano ($0.20/$1.25) handles classification and extraction. GPT-5.3-Codex ($1.75/$14, 400K context) was the coding-tuned line and is now superseded by gpt-5.5 in Codex; on Scale's standardized SWE-bench Pro public leaderboard, the gpt-5.4 (xHigh) variant leads at 59.10%. Cached input is billed at 10% of base. GPT-5.6 (variants Sol, Terra, Luna) entered limited preview on June 26, 2026, gated to about 20 government-approved companies at U.S. government request; it is not on the public API pricing page, with general availability promised in the coming weeks. See Codex pricing for the subscription side.

ModelInput/MTokCached InOutput/MTokContext
GPT-5.5$5.00$0.50$30.001M
GPT-5.5-pro$30.00n/a$180.00n/a
GPT-5.4$2.50$0.25$15.001M
GPT-5.3-Codex$1.75$0.175$14.00400K
GPT-5.4-mini$0.75$0.075$4.50400K
GPT-5.4-nano$0.20$0.02$1.25n/a

Anthropic (Claude)

Claude Opus 4.8 ($5/$25, 1M context) is Anthropic's GA flagship: 88.6% on SWE-bench Verified and 69.2% on SWE-bench Pro, the highest of any currently available model. Fable 5 (95.0%) and Mythos 5 Preview (93.9%) scored higher but were suspended on June 12, 2026 under a U.S. export-control directive and are disabled for all customers (see the notice above). An Opus 4.8 fast mode (research preview) delivers about 2.5x faster output at $10/$50, double the standard rate. All current 1M-context Claude models charge no long-context surcharge, but models from Opus 4.7 onward use a new tokenizer that can produce up to 35% more tokens for the same text. Full breakdown: Anthropic API pricing.

ModelInput/MTokOutput/MTokContextMax Output
Claude Fable 5 (suspended)$10.00$50.001M128K
Claude Opus 4.8 / 4.7 / 4.6$5.00$25.001M128K
Claude Sonnet 4.6$3.00$15.001M64K
Claude Haiku 4.5$1.00$5.00200K64K

Google (Gemini)

Gemini 3.1 Pro (preview) is the cheapest frontier-tier API at $2/$12 for prompts up to 200K tokens ($4/$18 above) and scores 80.6% on SWE-bench Verified, with a 1,048,576-token context. Gemini 3.5 Flash ($1.50/$9) is the stable workhorse Google positions for agentic and coding tasks. Gemini 3 Flash (preview, $0.50/$3.00) and Gemini 3.1 Flash-Lite ($0.25/$1.50) are the budget tiers. Batch runs at half price across the line.

DeepSeek

The price floor for frontier-adjacent quality. V4 Flash ($0.14/$0.28) and V4 Pro ($0.435/$0.87) both ship a 1M-token context. V4 (released April 24, 2026) is open weights: V4 Pro is 1.6T total / 49B active parameters, V4 Flash 284B / 13B. DeepSeek-V4-Pro-Max scores 80.6% on SWE-bench Verified, the highest open-weights entry, tied with Gemini 3.1 Pro. The API accepts both OpenAI and Anthropic request formats. Deep dive: DeepSeek V4.

Where you run DeepSeek changes its output. Most serverless providers quantize activations to fp8 to cut cost, which degrades quality away from the reference weights. Morph serves open-source models (DeepSeek V4, GLM-5.2, Qwen, MiniMax) with 16-bit (bf16) activations and does not quantize them, so output matches the released weights. morph-dsv4flash is $0.139/M input and $0.278/M output, and morph-glm52-744b is $1.1/$4.1 with a 1M context. For coding agents specifically, Morph adds codegen-tuned speculative decoding and custom low-level inference kernels, so it is the fastest and highest-fidelity way to run open-source models for codegen. See Morph models and pricing.

Moonshot (Kimi)

Kimi K2.6 ($0.95/$4.00) is Moonshot's open-weight flagship (1T parameters, 32B active) with a 256K-token context and 80.2% on SWE-bench Verified. It does automatic context caching. Kimi K2.5 (76.8%) is the prior version.

Z.AI (GLM)

GLM-5.2 ($1.40/$4.40, cached input $0.26/M, 1M context, 128K max output) is Z.AI's open-weights flagship (744B MoE, about 40B active), live as a standalone pay-per-token API since June 16, 2026. Z.AI published no official SWE-bench number at launch; the third-party llm-stats aggregate puts it at 62.1% on SWE-bench Pro. GLM-4.7-Flash and GLM-4.5-Flash are free, the only free named text models from a major provider. Morph serves morph-glm52-744b at $1.1/$4.1 with 16-bit activations.

MiniMax

MiniMax M3 ($0.60/$2.40 standard, launch promo $0.30/$1.20, 1M context, 512K max output) scores 80.5% on SWE-bench Verified and is open weights (released June 1, 2026). At standard pricing it is the cheapest 80%+ SWE-bench model on a hosted API, about one-tenth of Opus 4.8's output price for a score 8 points lower. MiniMax M2.7 ($0.18/$0.72, 200K context) is the prior generation.

Alibaba (Qwen)

Qwen3.7 Max ($1.25/$3.75, 1M context, 65,536 max output) is Alibaba's current proprietary flagship at 80.4% on SWE-bench Verified. Qwen3.6 Plus ($0.50/$3.00) scores 78.8%, and Qwen3.6-27B is open weights under Apache 2.0 (released April 22, 2026, 77.2%). New Model Studio users get 1M free tokens valid 90 days; batch is half price.

Aggregators and Cloud Resellers

OpenRouter fronts hundreds of models behind one OpenAI-compatible key, useful for evaluation and failover at a small markup. AWS Bedrock and Azure / Microsoft Foundry resell first-party models (Claude Opus 4.8, GPT-5.5) with VPC integration, compliance certifications, and consolidated cloud billing. If you need a routing layer rather than a reseller, see LLM gateways and LLM routers.

Morph (specialized)

Morph serves task-specific models rather than general chat: Fast Apply merges code edits at 10,500 tok/s, and WarpGrep does agentic codebase search ($0 for 100k requests, $1/1M on Pro). Covered in Specialized APIs below.

LLM API Rate Limits by Provider and Tier

Rate limits decide whether your launch survives traffic, and almost no comparison page publishes them. Here is what each provider enforces, from official docs as of June 28, 2026.

Anthropic: spend-based tiers, cache-aware token counting

Tiers advance automatically by cumulative credit purchase: Tier 1 at $5, Tier 2 at $40, Tier 3 at $200, Tier 4 at $400. Monthly spend caps are $500 / $500 / $1,000 / $200,000. Limits are per model class, measured in requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM). Cache reads do not count toward ITPM: with a 2M ITPM limit and an 80% cache hit rate you can process 10M total input tokens per minute.

Model classTier 1 (RPM / ITPM / OTPM)Tier 4 (RPM / ITPM / OTPM)
Claude Opus 4.x50 / 500K / 80K4,000 / 10M / 800K
Claude Sonnet 4.x50 / 30K / 8K4,000 / 2M / 400K
Claude Haiku 4.550 / 50K / 10K4,000 / 4M / 800K

Counterintuitive: Opus gets 16x the Tier 1 input throughput of Sonnet (500K vs 30K ITPM). If you are rate-limit-bound on a new Anthropic account, the expensive model is also the one you can call hardest.

OpenAI: spend-unlocked tiers, per-model limits

TierQualificationMonthly usage cap
FreeAllowed geography$100
Tier 1$5 paid$100
Tier 2$50 paid$500
Tier 3$100 paid$1,000
Tier 4$250 paid$5,000
Tier 5$1,000 paid$200,000

Per-model RPM/TPM numbers are published on each model's page in the OpenAI console rather than a single table, and GPT-5.5 carries a separate limit for long-context requests.

DeepSeek: concurrency caps instead of token budgets

DeepSeek publishes no RPM/TPM limits. It caps concurrent requests: 2,500 in flight on V4 Flash, 500 on V4 Pro. For batch-style pipelines this is friendlier than token-per-minute budgets; for bursty single requests it makes no difference.

Others

Moonshot, Z.AI, MiniMax, and Alibaba scale limits with account spend and publish them in their consoles rather than public tables. MiniMax sells a priority tier for latency-sensitive traffic. Bedrock and Azure enforce cloud-account-level quotas you raise through support tickets.

OpenAI-Compatibility Matrix

Most providers accept OpenAI-format /v1/chat/completions requests, so switching is a base_url and api_key change. The exceptions matter when you build against provider-specific features.

ProviderOpenAI chat formatAnthropic formatStreamingTool calls
OpenAINativeNoYesYes
AnthropicVia OpenAI SDK compat layerNativeYesYes
Google GeminiCompat endpointNoYesYes
DeepSeekYesYesYesYes
Moonshot (Kimi)YesNoYesYes
Z.AI (GLM)YesNoYesYes
MiniMaxYesNoYesYes
Alibaba (Qwen)Compat modeNoYesYes
OpenRouterNative (aggregator)NoYesYes
MorphYesNoYesn/a (task models)

DeepSeek is the only first-party provider that natively speaks both OpenAI and Anthropic formats, which means it drops into Claude-Code-style agents without a proxy. If you point a coding agent at a custom provider, the agent must support it: Codex, for example, configures custom providers in config.toml via model_providers entries with base_url, env_key, and wire_api = "responses" (the only wire API it supports), covered in Codex provider configuration.

Benchmarks vs Price: What You Get per Dollar

SWE-bench Verified (real GitHub issues, verified fixes) is the most cited coding benchmark. Scores below are from the llm-stats tracker, June 2026; prices are official API rates.

ModelSWE-bench VerifiedInput/MTokOutput/MTokValue ($/MTok out per point)
Claude Fable 5 (suspended)95.0%$10.00$50.00suspended
Claude Mythos 5 Preview (suspended)93.9%restrictedrestrictedn/a
GPT-5.588.7%$5.00$30.00$0.34
Claude Opus 4.888.6%$5.00$25.00$0.28
Claude Opus 4.787.6%$5.00$25.00$0.29
DeepSeek V4 Pro Max80.6%open weightsopen weightsopen weights
Gemini 3.1 Pro80.6%$2.00$12.00$0.15
MiniMax M380.5%$0.60$2.40$0.03
Qwen3.7 Max80.4%$1.25$3.75$0.05
Kimi K2.680.2%$0.95$4.00$0.05
The 80% club spans a 5x price range

Five models score between 80.2% and 80.6% on SWE-bench Verified: DeepSeek V4 Pro Max, Gemini 3.1 Pro, MiniMax M3, Qwen3.7 Max, and Kimi K2.6. Within that 0.4-point band, standard output prices run from $2.40/M (MiniMax M3) to $12/M (Gemini 3.1 Pro), and DeepSeek V4 Pro Max ships it in open weights. The next 8 points (GPT-5.5 at 88.7%, Opus 4.8 at 88.6%) cost $25 to $30/M output, a 10x step. Whether those points are worth it depends on whether your tasks live in the gap.

On the harder SWE-bench Pro (1,865 tasks, 41 professional repos), Scale's standardized public leaderboard tops out at gpt-5.4 (xHigh) 59.10%, Claude Opus 4.6 (thinking) 51.90%, and Gemini 3.1 Pro (thinking) 46.10%. Vendor self-reported aggregates run much higher (llm-stats lists Claude Opus 4.8 at 69.2% and GLM-5.2 at 62.1%), so compare scores only within the same harness.

Read SWE-bench numbers as vendor claims

The SWE-bench Verified figures above are vendor self-reported. The llm-stats tracker lists 102 self-reported results and 0 independently verified, so treat the rankings as vendor claims. Scale's SWE-bench Pro public set runs on a single standardized harness (Pass@1), which is why its numbers are much lower and more directly comparable across models.

LLM API Free Tiers: Exact Amounts

ProviderFree offerLimit
Z.AIGLM-4.7-Flash, GLM-4.5-FlashFree models, no token charge
Alibaba Model Studio1M tokens for new users90-day validity
OpenAI APIFree tier in allowed regions$100/month usage cap
OpenAI CodexIncluded with ChatGPT FreeLowest 5-hour-window message limits
Google Gemini APIFree development tierReduced rate limits
AnthropicNoneTier 1 starts at $5 credit
Morph WarpGrep100k requestsThen $1 per 1M requests on Pro

For experimentation, the practical order is: GLM Flash models (unlimited free), Qwen's 1M tokens, then Gemini's free tier. For coding agents specifically, Codex CLI works with a free ChatGPT sign-in, with the lowest usage limits.

Cost Calculator: Real Workloads

Per-token prices mean nothing until mapped to usage. Two reference workloads, 30-day months, no cache discounts applied (caching reduces all of these).

Coding agent: 50M input / 5M output tokens per day

ModelDaily costMonthly cost
GPT-5.5$400$12,000
Claude Opus 4.8$375$11,250
Claude Sonnet 4.6$225$6,750
GPT-5.4$200$6,000
Gemini 3.1 Pro (≤200K prompts)$160$4,800
GLM-5.2$92$2,760
MiniMax M3$42$1,260
Qwen3.6 Plus$40$1,200
DeepSeek V4 Flash$8.40$252

Support chatbot: 20M input / 5M output tokens per day

ModelDaily costMonthly cost
Claude Haiku 4.5$45.00$1,350
GPT-5.4-mini$37.50$1,125
Gemini 3 Flash (preview)$25.00$750
Gemini 3.1 Flash-Lite$12.50$375
GPT-5.4-nano$10.25$308
DeepSeek V4 Flash$4.20$126
The 48x gap on identical traffic

The same coding-agent workload costs $12,000/month on GPT-5.5 and $252/month on DeepSeek V4 Flash, a 48x gap. The production answer is rarely either extreme: route routine edits to a cheap model, escalate multi-file refactors to a frontier one, and cache aggressively (DeepSeek cache hits bill input at about $0.0036/M; Anthropic cache reads are 0.1x and do not count against rate limits). Model the routing split with the LLM cost calculator.

Latency and Throughput

Two metrics matter: time to first token (how fast streaming starts) and tokens per second (how fast it finishes). Provider speed claims vary with load and region, so measure on your own traffic.

Measured medians from public endpoint benchmarks (Artificial Analysis, June 2026). General-purpose chat models cluster in the tens-to-low-hundreds of tokens per second; task-specific models built for one operation run far higher.

ModelOutput tok/s (median)TTFTProvider
Morph Fast Apply10,500sub-secondMorph

Why Morph Fast Apply runs two orders of magnitude above frontier chat models: applying a code edit (merging a lazy edit snippet into the full file) does not need frontier reasoning, so the model is tuned for throughput with codegen-specific speculative decoding. For published per-model output-speed and TTFT figures across the general-purpose providers, see the live Artificial Analysis benchmark.

What the official docs also establish:

  • Reasoning adds latency by design. DeepSeek V4 exposes separate thinking and non-thinking modes so you can opt out per request; Claude Opus 4.8 runs adaptive thinking, and thinking tokens are generated and billed before visible output.
  • Anthropic sells speed explicitly: Opus 4.8 fast mode (research preview) delivers roughly 2.5x faster output at $10/$50 per MTok, double the standard rate. The previous generation's fast mode (Opus 4.6/4.7) costs $30/$150.
  • MiniMax sells a priority tier for faster scheduling on M3.
  • Specialized models break the general-purpose ceiling: Morph Fast Apply sustains 10,500 tok/s on code-edit application, two orders of magnitude above frontier chat models, because the task (merging an edit into a file) does not need frontier reasoning.

Specialized APIs: When General-Purpose Falls Short

Coding agents spend most of their compute on two operations: searching codebases for context and applying edits to files. Cognition (the team behind Devin) measured 60% of agent time on search alone. Both operations run through general-purpose LLMs by default, at general-purpose prices and speeds.

Morph Fast Apply

Code-edit application at 10,500 tok/s with 98% accuracy. The agent outputs a lazy edit snippet; Fast Apply merges it into the full file in 1-3 seconds. OpenAI-compatible /v1/chat/completions endpoint.

Morph WarpGrep

RL-trained agentic codebase search: 8 parallel tool calls per turn, 4 turns, sub-6s searches, 0.73 F1. 100k requests free, then $1 per 1M requests on Pro. Ships as an MCP server for any agent.

The pattern generalizes: a frontier model reasons and decides what to change; narrow, fast models execute the mechanical steps. The frontier model's output shrinks (edit snippets instead of whole files), which is exactly the token class that costs $12-30/M. See Fast Apply and WarpGrep.

Best LLM API by Use Case

One ranking does not fit every job. Pick by the constraint that binds you. Each pick below comes from the data already on this page.

Best for reasoning and hard coding

Claude Opus 4.8 ($5/$25) at 88.6% on SWE-bench Verified and GPT-5.5 (88.7%, $5/$30) are the top available picks after Fable 5 (95.0%) was suspended. On the harder SWE-bench Pro, gpt-5.4 (xHigh) leads Scale's standardized public set at 59.10%. Use these for multi-file refactors and tasks where a cheaper model measurably fails.

Best value

MiniMax M3 ($0.60/$2.40) scores 80.5% on SWE-bench Verified, the cheapest model above 80%, at roughly $0.03 of output per benchmark point versus $0.28 for Opus 4.8. DeepSeek V4 Pro Max (80.6%, open weights) matches it for self-hosting. Gemini 3.1 Pro ($2/$12) is the cheapest frontier-tier API when you need a hosted name.

Best for speed

For code edits, Morph Fast Apply is the uncontested pick: 10,500 tok/s with 98% accuracy, two orders of magnitude above frontier chat models, because merging an edit into a file does not need frontier reasoning. For frontier chat speed, Anthropic sells an Opus 4.8 fast mode (research preview) at roughly 2.5x faster output for $10/$50.

Morph Fast Apply: best for code edits

10,500 tok/s, 98% accuracy. The frontier model decides what to change; Fast Apply merges the edit snippet into the full file in 1-3 seconds, behind an OpenAI-compatible endpoint. See /products/compact.

Morph model router

Route each request to the cheapest model that can handle it, roughly 430ms of routing overhead at about $0.001 per request. One key, OpenAI-compatible. See /llm-router.

Best on a budget

DeepSeek V4 Flash ($0.14/$0.28, 1M context, about $0.0036/M on cache hits) is the price floor. GLM-4.7-Flash and GLM-4.5-Flash are free on the Z.AI API. For high-volume simple traffic, start here and escalate only the requests that measurably need a frontier model.

How to Choose an LLM API

Decision framework
  • 1. Establish your quality floor cheaply. Run your real prompts through DeepSeek V4 Flash ($0.28/M out), MiniMax M3 ($2.40/M), and GPT-5.4-nano ($1.25/M). If one passes, you are done at a fraction of frontier cost.
  • 2. Escalate only measured gaps. Move to Gemini 3.1 Pro ($12/M), GPT-5.4 ($15/M), Opus 4.8 ($25/M), or GPT-5.5 ($30/M) for the tasks where the cheap tier measurably fails.
  • 3. Check the constraint that binds you. Rate-limited on day one? Anthropic Tier 1 gives Opus 500K ITPM vs Sonnet's 30K. Need self-hosting or data control? DeepSeek V4 and Qwen 3.6 are open weights. Compliance? Bedrock or Foundry.
  • 4. Use specialized APIs for mechanical steps. Edit application, search, embeddings, and reranking all have purpose-built models that beat $15-30/M generalists on both speed and cost.

The most common mistake is anchoring on one provider's flagship and never testing down. Five models clear 80% on SWE-bench Verified for $2.40 to $12 per million output; the GA frontier (GPT-5.5, Opus 4.8) costs $25 to $30. The second most common is ignoring caching: at an 80% cache hit rate, Anthropic bills cached reads at 0.1x and exempts them from rate limits, and DeepSeek drops cached input to about $0.0036/M. For long-context workloads, compare windows in detail at LLM context window comparison.

Frequently Asked Questions

What is an LLM API?

An HTTP interface to a hosted large language model. You POST a prompt to an endpoint such as /v1/chat/completions and receive generated tokens, billed per million tokens of input and output. It replaces self-hosted GPU inference with a metered service.

What are the best LLM API providers in 2026?

First-party: OpenAI (GPT-5.5), Anthropic (Claude Opus 4.8), Google (Gemini 3.1 Pro), DeepSeek (V4), Moonshot (Kimi K2.6), Z.AI (GLM-5.2), MiniMax (M3), and Alibaba (Qwen3.7 Max, Qwen3.6 Plus). Aggregator: OpenRouter. Cloud resellers: Bedrock and Azure / Microsoft Foundry. Which is best depends on the binding constraint: benchmark score (Anthropic, OpenAI), price (DeepSeek, MiniMax), free tier (Z.AI), or compliance (cloud resellers).

What is the cheapest LLM API in 2026?

DeepSeek V4 Flash: $0.14/M input, $0.28/M output, about $0.0036/M on cache hits, with a 1M-token context. MiniMax M3 ($0.60/$2.40) is the cheapest model above 80% on SWE-bench Verified. GLM-4.7-Flash is free outright.

Which LLM API is best for coding?

Claude Opus 4.8 (88.6%, $5/$25) and GPT-5.5 (88.7%, $5/$30) are the top available picks; Anthropic's Fable 5 led at 95.0% but was suspended on June 12, 2026. On Scale's standardized SWE-bench Pro, gpt-5.4 (xHigh) leads the public set at 59.10%. For budget coding, MiniMax M3 (80.5%) and open-weights DeepSeek V4 Pro Max (80.6%) are the value picks. For applying edits an agent has already decided on, Morph Fast Apply runs at 10,500 tok/s with 98% accuracy.

What rate limits do LLM APIs have?

Anthropic: tiered by cumulative deposit ($5 to $400); Tier 1 gives Opus 4.x 50 RPM / 500K input tokens per minute, Tier 4 gives 4,000 RPM / 10M ITPM, and cached tokens are exempt. OpenAI: tiers unlock at $5 to $1,000 paid with $100 to $200,000 monthly usage caps; per-model RPM/TPM live on each model page. DeepSeek: concurrency caps of 2,500 (V4 Flash) and 500 (V4 Pro) instead of token budgets.

Which LLM API has the largest context window?

1M tokens is the 2026 flagship standard: Claude Opus 4.8/4.7/4.6 and Sonnet 4.6 (all with no long-context surcharge), GPT-5.5, GPT-5.4, Gemini 3.1 Pro (1,048,576), DeepSeek V4 Pro and Flash, MiniMax M3 (1,048,576), Qwen3.7 Max, Qwen3.6 Plus, and GLM-5.2. Below 1M: GPT-5.3-Codex and GPT-5.4-mini at 400K, Kimi K2.6 at 256K, Claude Haiku 4.5 and MiniMax M2.7 at 200K.

Are LLM APIs interchangeable / OpenAI-compatible?

DeepSeek, Moonshot, Z.AI, MiniMax, Qwen, OpenRouter, and Morph accept OpenAI-format requests directly; Google exposes a compatibility endpoint; Anthropic offers an OpenAI SDK compatibility layer over its native Messages API. DeepSeek also accepts Anthropic-format requests. In practice, switching providers is a base URL and key change, with re-testing for tool-calling behavior.

Should I use one provider or multiple?

Multiple, behind a router. Send high-volume simple traffic to a sub-$2.40/M model and escalate hard tasks to a frontier model. The 48x spread between GPT-5.5 and V4 Flash on identical traffic is the budget you are leaving on the table with a single-provider setup. See LLM gateways for the plumbing.

Which LLM API is best by use case?

Reasoning and hard coding: Claude Opus 4.8 (88.6%, $5/$25) and GPT-5.5 (88.7%, $5/$30). Value: MiniMax M3 ($0.60/$2.40), the cheapest model above 80%, with open-weights DeepSeek V4 Pro Max (80.6%) for self-hosting. Speed: Morph Fast Apply for code edits at 10,500 tok/s. Budget: DeepSeek V4 Flash ($0.14/$0.28) or the free GLM Flash models on Z.AI. Most teams route across these with the Morph model router.

Sources

Every price and benchmark on this page traces to a primary source, checked June 28, 2026:

Related Resources

Code Editing at 10,500 tok/s

Frontier LLM APIs bill $12-30 per million output tokens to rewrite whole files. Morph Fast Apply merges edit snippets into files at 10,500 tok/s with 98% accuracy, behind an OpenAI-compatible API.