DeepSeek API: Models, Pricing, and How to Call It (2026)

The DeepSeek API serves deepseek-v4-flash and deepseek-v4-pro behind an OpenAI-compatible endpoint at api.deepseek.com. Flash runs $0.14/1M input (cache miss) and $0.28/1M output across a 1M-token context. The legacy deepseek-chat and deepseek-reasoner aliases deprecate 2026/07/24. This page covers every model, exact per-1M pricing with cache-hit rates, context windows, benchmarks, and a working OpenAI-SDK call.

June 18, 2026 · 2 min read
DeepSeek API: Models, Pricing, and How to Call It (2026)

The DeepSeek API serves two Mixture-of-Experts models behind an OpenAI-compatible endpoint at api.deepseek.com: deepseek-v4-flash and deepseek-v4-pro, both with a 1M-token context. Flash costs $0.14/1M input (cache miss) and $0.28/1M output. The legacy deepseek-chat and deepseek-reasoner aliases deprecate 2026/07/24. Morph runs a separate OpenAI-compatible router that includes a DeepSeek-class model, dsv4flash.

$0.14/1M
deepseek-v4-flash input (cache miss)
$0.28/1M
deepseek-v4-flash output
1M tokens
API context window
$0.0028/1M
Flash input on cache hit

The DeepSeek API in One Paragraph

DeepSeek is an AI lab that ships open-weight Mixture-of-Experts models and a hosted API to call them. The API endpoint is api.deepseek.com and it speaks the OpenAI wire format, so any OpenAI SDK works after you change the base URL and model name. Two models are live: deepseek-v4-flash for cost-sensitive workloads and deepseek-v4-pro for higher capability.

Both API models carry a 1M-token context window. Pricing is per million tokens and split by whether the input prefix is a cache hit or a cache miss. A cache hit (a repeated system prompt or few-shot block already in DeepSeek's context cache) is priced 50x lower than a cache miss on flash.

The open-weight side is where the architecture lives. DeepSeek-V3.2-Exp is 685B total MoE parameters under an MIT license, and DeepSeek-R1 is 671B total with 37B activated. These weights are downloadable for self-hosting, which is the main reason DeepSeek shows up in cost comparisons against closed APIs.

What changed in 2026

The deepseek-chat and deepseek-reasoner model names are legacy aliases of deepseek-v4-flash (non-thinking and thinking modes). They deprecate on 2026/07/24 at 15:59 UTC per DeepSeek's pricing page. Migrate calls to deepseek-v4-flash before that date and toggle thinking in the request rather than switching model names.

DeepSeek Models and Pricing

The table below lists the DeepSeek API models with context, per-1M input pricing (cache miss), per-1M output pricing, and notes. All figures are from DeepSeek's pricing page as of June 2026.

ModelContextInput (cache miss)OutputNotes
deepseek-v4-flash1M$0.14$0.28Cost tier. Cache hit input $0.0028.
deepseek-v4-pro1M$0.435$0.87Capability tier. Cache hit input $0.003625.
deepseek-chat (legacy)1M$0.14$0.28Non-thinking alias of flash. Deprecates 2026/07/24.
deepseek-reasoner (legacy)1M$0.14$0.28Thinking alias of flash. Deprecates 2026/07/24.

deepseek-v4-flash is the default for most work. At $0.14/1M input and $0.28/1M output it is one of the cheapest 1M-context models with frontier-adjacent coding scores. deepseek-v4-pro costs roughly 3.1x the input and 3.1x the output of flash and is the choice when flash misses on harder reasoning.

The two legacy aliases are the same model

deepseek-chat and deepseek-reasoner are not separate models. They are the non-thinking and thinking modes of deepseek-v4-flash, exposed under distinct names for compatibility. After the 2026/07/24 deprecation, you call deepseek-v4-flash and set thinking through the request body.

Cache Hit vs Cache Miss Pricing

DeepSeek prices input tokens differently depending on whether the prefix is already in its context cache. A cache hit means a leading chunk of your prompt (system instructions, few-shot examples, a long document you keep re-sending) matches a recently processed prefix. DeepSeek charges the cache-hit rate for those tokens and the cache-miss rate for the rest.

ModelInput (cache hit)Input (cache miss)Hit vs miss ratio
deepseek-v4-flash$0.0028$0.1450x cheaper on hit
deepseek-v4-pro$0.003625$0.435120x cheaper on hit

The practical effect is large for agent workloads. A coding agent re-sends the same system prompt and tool definitions on every turn. Those tokens land as cache hits, so the marginal input cost of each turn drops toward the cache-hit rate. Structure prompts so the stable prefix comes first and the per-request content comes last to maximize cache hits.

The tradeoff: cache behavior is not guaranteed across every request, and pricing assumes a recently seen prefix. Cold prefixes, frequently changing system prompts, and one-shot calls pay the full cache-miss rate. Do not budget your whole input at the cache-hit price.

deepseek-chat vs deepseek-reasoner

deepseek-chat answers directly. deepseek-reasoner emits a chain-of-thought before the answer, which raises accuracy on math and multi-step coding at the cost of more output tokens and higher latency. Both are modes of the same underlying deepseek-v4-flash model and both legacy names retire on 2026/07/24.

deepseek-chat (non-thinking)

Direct answers, fewer output tokens, lower latency. Use for edits, formatting, summarization, and straightforward generation where the answer does not need a reasoning trace.

deepseek-reasoner (thinking)

Emits a reasoning trace before the answer. Higher accuracy on AIME-style math and multi-step debugging, but more output tokens and slower. Use for hard reasoning where the extra cost pays off.

Because reasoner produces more output tokens, its effective cost per answer is higher even at the same $0.28/1M output rate. Reserve thinking mode for prompts where a chain-of-thought measurably improves the result, and route easy prompts to non-thinking. A router that classifies difficulty handles this split automatically, which is the same pattern Morph's LLM router applies across providers.

Context Window and Architecture

The hosted API models expose a 1M-token context window. The open-weight checkpoints behind the line expose smaller windows: DeepSeek-V3.2-Exp supports 163,840 tokens (160K) and DeepSeek-R1 supports 128K. If you self-host the weights, plan around the checkpoint window, not the 1M API figure.

CheckpointTotal paramsContextLicenseReleased
DeepSeek-V3.2-Exp685B (MoE)163,840 (160K)MITNov 17, 2025
DeepSeek-R1671B / 37B active128KMITJan 22, 2025

DeepSeek-V3.2-Exp introduces DeepSeek Sparse Attention (DSA), a sparse attention mechanism for long-context training and inference efficiency, per its Hugging Face model card. Sparse attention is what makes a 160K-token window practical on a 685B-parameter MoE without the quadratic cost of dense attention at that length.

DeepSeek-R1 is a Mixture-of-Experts model with 671B total parameters and 37B activated per token. Only the activated experts run on each forward pass, so inference cost tracks the 37B active count rather than the 671B total. Both checkpoints are MIT-licensed, which permits commercial use and self-hosting.

For a broader treatment of how context length affects cost and quality, see LLM context windows.

Benchmarks

Published scores place DeepSeek-V3.2-Exp in the top open tier for coding and reasoning. The table reports model-card figures. Treat cross-model comparisons cautiously, since harness and prompt differences move scores by several points.

BenchmarkDeepSeek-V3.2-ExpDeepSeek-R1
SWE-bench Verified67.849.2
LiveCodeBench74.1n/a
AIME (2025 / 2024)89.3 (2025)79.8 (2024)
MMLU-Pro / MMLU85.0 (Pro)90.8 (MMLU)
Codeforces rating21212029
GPQA-Diamond79.9n/a

The SWE-bench Verified jump from 49.2 (R1) to 67.8 (V3.2-Exp) is the headline for coding agents: a 18.6-point gain on real GitHub issue resolution between the January 2025 reasoning checkpoint and the November 2025 release. DeepSeek-R1 still posts strong pure-math numbers, with 97.3 on MATH-500 and 79.8 on AIME 2024.

For how these scores stack against other open models, see the best open-source coding model in 2026.

OpenAI-Compatible Calls

The DeepSeek API is a drop-in for the OpenAI SDK. Change base_url to https://api.deepseek.com, set your DeepSeek key, and pass deepseek-v4-flash or deepseek-v4-pro as the model. Streaming, function calling, and the message array work as they do against OpenAI.

Call the DeepSeek API with the OpenAI Python SDK

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_DEEPSEEK_API_KEY",
    base_url="https://api.deepseek.com",  # OpenAI-compatible endpoint
)

resp = client.chat.completions.create(
    model="deepseek-v4-flash",  # or "deepseek-v4-pro"
    messages=[
        {"role": "system", "content": "You are a senior Python engineer."},
        {"role": "user", "content": "Write a retry decorator with exponential backoff."},
    ],
)

print(resp.choices[0].message.content)

The same shape works from cURL. Set the Authorization header to your key and POST the OpenAI-style body to the chat completions path.

Call the DeepSeek API with cURL

curl https://api.deepseek.com/chat/completions \
  -H "Authorization: Bearer YOUR_DEEPSEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [
      {"role": "user", "content": "Explain MoE routing in two sentences."}
    ]
  }'

Because the contract is OpenAI-shaped, a single OpenAI-compatible client can target DeepSeek, OpenAI, or Morph's router by swapping the base URL. Morph's router at api.morphllm.com follows the same format, so the call below changes only the host and model and adds automatic difficulty-based routing.

Same client, Morph router base URL (auto-routes to cheaper models)

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MORPH_API_KEY",
    base_url="https://api.morphllm.com/v1",  # OpenAI-compatible router
)

# The router classifies difficulty (~430ms) and routes easy calls
# to cheaper models, cutting cost 40-70% across the request mix.
resp = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Rename this variable for clarity."}],
)

How DeepSeek Pricing Compares

DeepSeek's value is the combination of a 1M-token context and per-1M prices well under most hosted frontier APIs. The table below sets deepseek-v4-flash and deepseek-v4-pro against two other open-weight APIs from the same facts pack to anchor the range. Closed-model comparisons are omitted where verified numbers were not available.

ModelInputOutputContext
deepseek-v4-flash$0.14$0.281M
deepseek-v4-pro$0.435$0.871M
MiniMax-M2$0.30$1.20~196K
GLM-4.6$0.60$2.20200K

On output price, deepseek-v4-flash at $0.28/1M undercuts MiniMax-M2 ($1.20/1M) and GLM-4.6 ($2.20/1M) by 4x to 8x, while carrying a larger 1M-token context. deepseek-v4-pro at $0.87/1M output still sits below both. The cache-hit input rate ($0.0028/1M on flash) widens the gap further on repeated-prefix workloads.

The tradeoff is operational, not headline price. A single-provider API means a single point of rate limits, region, and capacity. Routing across providers through one endpoint hedges that. See LLM cost optimization for the full set of levers (caching, routing, batching, model tiering).

Morph measured: dsv4flash on Morph's fleet

Morph serves a DeepSeek-class model, dsv4flash, on its own production fleet alongside glm51-754b, qwen35-397b (~120 tok/s), minimax27-230b (~140 tok/s), and warp-grep-v2.1, all reachable through one OpenAI-compatible endpoint at api.morphllm.com. The router classifies prompt difficulty in ~430ms and sends easy calls to cheaper models, which is where the 40-70% savings come from.

Getting an API Key

Create a DeepSeek platform account, open the API keys section, and generate a key. Set it as the api_key in the OpenAI SDK or as an Authorization Bearer header, then point base_url at https://api.deepseek.com. Billing is usage-based at the per-1M-token prices above, metered on input (split by cache state) and output.

If you want one key that reaches many models, Morph issues a single key at morphllm.com that works against the OpenAI-compatible router at api.morphllm.com, including the DeepSeek-class dsv4flash. That removes per-provider key management when you call more than one model family.

Frequently Asked Questions

How much does the DeepSeek API cost?

deepseek-v4-flash costs $0.14/1M input on a cache miss, $0.0028/1M on a cache hit, and $0.28/1M output. deepseek-v4-pro costs $0.435/1M input (cache miss), $0.003625/1M (cache hit), and $0.87/1M output. Both have a 1M-token context. Prices are from DeepSeek's pricing page as of June 2026.

What is the difference between deepseek-chat and deepseek-reasoner?

deepseek-chat is non-thinking (direct answer) mode and deepseek-reasoner is thinking (chain-of-thought) mode of deepseek-v4-flash. Both are legacy aliases that deprecate on 2026/07/24 at 15:59 UTC. After that you call deepseek-v4-flash and toggle thinking in the request.

What is the DeepSeek context window?

The deepseek-v4-flash and deepseek-v4-pro API models both expose a 1M-token context window. The open-weight checkpoints are smaller: DeepSeek-V3.2-Exp supports 163,840 tokens (160K) and DeepSeek-R1 supports 128K.

Is the DeepSeek API OpenAI-compatible?

Yes. Point the OpenAI SDK at base_url https://api.deepseek.com, set your DeepSeek key, and call client.chat.completions.create with model deepseek-v4-flash or deepseek-v4-pro. No code changes beyond the base URL and model name.

DeepSeek V4 vs R1: which should I use?

deepseek-v4-flash and deepseek-v4-pro are the current API models with a 1M context and split cache pricing. DeepSeek-R1 (671B total, 37B activated, 128K context, 49.2 on SWE-bench Verified) is the earlier January 2025 reasoning checkpoint. V4 supersedes R1 for new builds; R1 weights stay available under MIT for self-hosting.

How do I get a DeepSeek API key?

Create a DeepSeek platform account, generate a key in the API keys section, set it as the api_key in the OpenAI SDK, and point base_url at https://api.deepseek.com. Billing is usage-based per the prices above.

Can I reach a DeepSeek-class model through one shared endpoint?

Yes. Morph's OpenAI-compatible router at api.morphllm.com serves many models through one endpoint, including dsv4flash, a DeepSeek-class model on Morph's fleet. The router classifies difficulty in ~430ms and routes easy calls to cheaper models for 40-70% savings.

Related Resources

One Key, Many Models, Auto-Routed

Morph runs an OpenAI-compatible router at api.morphllm.com that serves many models through one endpoint, including the DeepSeek-class dsv4flash. It classifies prompt difficulty in ~430ms and routes easy calls to cheaper models for 40-70% cost savings. Swap one base URL to get started.