DeepSeek V4: 1.6T MoE, 1M Context, $0.87/M Output. Architecture, Benchmarks, Pricing (2026)

TL;DR

Last updated June 2026.

DeepSeek V4 is an open-weight (MIT) mixture-of-experts model family from DeepSeek, released April 24, 2026 in two variants. V4-Pro: 1.6T total parameters, 49B active, $0.435/M input, $0.87/M output. V4-Flash: 284B total, 13B active, $0.14/M input, $0.28/M output. Both default to 1M-token context with 384K max output. DeepSeek-V4-Pro-Max scores 80.6% on SWE-bench Verified, the highest open-weights entry, tied with Gemini 3.1 Pro. Per output token, V4-Pro is 28.7x cheaper than Claude Opus 4.8 and 34.5x cheaper than GPT-5.5.

What it is

Two MoE models with 1M-token context. Attention: token-wise compression + DeepSeek Sparse Attention (DSA). Open weights on Hugging Face, MIT license per Lambda's launch analysis. API speaks both OpenAI ChatCompletions and Anthropic formats.

Why it matters

80.6% SWE-bench Verified at $0.87/M output. Opus 4.8 scores 88.6% at $25/M output. A dollar buys 1.15M output tokens from V4-Pro, 40K from Opus 4.8, 33K from GPT-5.5.

What Is DeepSeek V4?

DeepSeek V4 is an open-weight mixture-of-experts (MoE) model family from DeepSeek, released April 24, 2026 under the MIT license. It ships in two variants: V4-Pro (1.6T total parameters, 49B active per token) and V4-Flash (284B total, 13B active). Both default to a 1M-token context window with 384K max output, and the weights are on Hugging Face. The API speaks both the OpenAI ChatCompletions and Anthropic formats, so it drops into tools like Claude Code and OpenCode without a proxy.

$0.139 / $0.278

morph-dsv4flash input / output per 1M (16-bit bf16, no fp8 quant)

Most serverless hosts quantize V4 activations to fp8 to cut cost, which moves output away from the reference weights. Morph serves morph-dsv4flash (DeepSeek V4 Flash) at 16-bit bf16 with no fp8 activation quantization, so output matches the published weights, at $0.139/M input and $0.278/M output. See Morph Open Source Models and pricing.

Release Date and Models

DeepSeek released V4 on April 24, 2026 as a preview, announced on its API docs news page. Two models shipped the same day:

deepseek-v4-pro: 1.6T total parameters, 49B active per token.
deepseek-v4-flash: 284B total parameters, 13B active per token.

Both run with a 1M-token context window by default across all official DeepSeek services, support JSON output, tool calls, and thinking plus non-thinking modes, and expose 384K max output tokens. The legacy deepseek-chat and deepseek-reasoner endpoints now map to deepseek-v4-flash (non-thinking and thinking mode) and will be fully retired after July 24, 2026 15:59 UTC. The API supports both the OpenAI ChatCompletions format and the Anthropic API format, which is what makes the Claude Code setup below a three-line config.

Weights are on Hugging Face (deepseek-ai/DeepSeek-V4-Pro, deepseek-ai/DeepSeek-V4-Flash), released under the MIT license per Lambda's launch analysis.

V4 Pro vs V4 Flash: Specs and Pricing

1.6T / 49B

V4-Pro total / active params

284B / 13B

V4-Flash total / active params

1M / 384K

Context / max output (both)

DeepSeek V4 Pro vs Flash (official API, June 2026)

Specification	V4-Pro	V4-Flash
Total parameters	1.6 trillion	284 billion
Active parameters per token	49B	13B
Context window	1,000,000 tokens	1,000,000 tokens
Max output tokens	384K	384K
Input, cache miss / 1M tokens	$0.435	$0.14
Input, cache hit / 1M tokens	$0.003625	$0.0028
Output / 1M tokens	$0.87	$0.28
Concurrency limit	500	2500
Release date	April 24, 2026 (preview)	April 24, 2026 (preview)
Weights	Hugging Face, MIT	Hugging Face, MIT

The cache-hit input prices are the standout line items. A cache hit on V4-Pro costs $0.003625/M, 120x less than its cache-miss rate. Agentic coding loops that resend the same system prompt and file context on every turn see most of their input tokens land in cache, so effective per-session cost runs far below the list cache-miss rate.

Which to pick: Pro activates 3.8x more parameters per token and is the variant behind the 80.6% SWE-bench Verified score (as Pro-Max). Flash costs 3.1x less per output token and allows 5x the concurrency (2500 vs 500), which makes it the default for batch pipelines, evals, and high-volume extraction.

DeepSeek V4 variant matrix: Pro vs Pro-Max vs Flash

Three names circulate, which causes confusion. V4-Pro is the served 1.6T MoE. V4-Pro-Max is the benchmark configuration of Pro (the entry that posts the 80.6% SWE-bench Verified score); DeepSeek does not list a separate Pro-Max API price, so the output rate below is Pro's. V4-Flash is the smaller, cheaper variant.

DeepSeek V4 variant matrix (June 2026)

Variant	Total / active params	Context	SWE-bench Verified	DeepSWE	Output / 1M	Release
V4-Pro	1.6T / 49B	1M	see Pro-Max	8% (reported)	$0.87	Apr 24, 2026
V4-Pro-Max	1.6T / 49B	1M	80.6%	8% (reported)	$0.87 (Pro rate)	Apr 24, 2026
V4-Flash	284B / 13B	1M	no entry	no entry	$0.28	Apr 24, 2026

The DeepSWE figures (8% pass@1 for V4-Pro / Pro-Max) are reported on the DeepSWE tracker and reconciled against the SWE-bench Verified number in the next section.

DeepSeek V4 Architecture

Per DeepSeek's release notes, V4's attention combines token-wise compression with DSA (DeepSeek Sparse Attention). DeepSeek's own documentation is sparse on internals; the detail below comes from the model card and third-party technical analyses of the weights, and is marked as such.

Attention: token-wise compression + DSA

The official description: each layer compresses the KV cache token-wise, then applies DeepSeek Sparse Attention over the compressed representation. This is what makes a 1M-token default context economical to serve at $0.435/M input.

Third-party analyses (Lambda's launch breakdown and model-card readers) describe the mechanism as Compressed Sparse Attention: KV caches compressed 4x along the sequence dimension, with a lightning indexer selecting the top 1,024 compressed KV entries per query. Treat the 4x and top-1,024 figures as secondary-source; DeepSeek's news page does not publish them.

CSA and HCA: the hybrid that makes 1M context cheap

The Hugging Face DeepSeek V4 blog names the attention stack a hybrid of two mechanisms interleaved across layers, not a single "DSA" block (DSA is the V3.2-era term). The two are Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA).

CSA compresses KV entries 4x along the sequence dimension using softmax-gated pooling with a learned positional bias. A lightning indexer (run in FP4, a ReLU-scored multi-head dot product) selects the top-k compressed blocks per query, and a sliding-window branch handles the most recent uncompressed tokens.
HCA compresses KV entries 128x and drops sparse selection entirely: every query attends densely to every compressed block.

Per the HF blog, V4-Pro is a 61-layer stack where layers 0 and 1 are HCA and layers 2 through 60 alternate CSA and HCA. Both paths store most KV entries in FP8 and keep BF16 only for the RoPE dimensions. The compound effect: the KV cache lands at roughly 2% the size of 8-head grouped-query attention in bfloat16 (V4-Pro at about 10% of V3.2's KV memory, V4-Flash at about 7%), and single-token inference runs at 27% of V3.2's FLOPs for Pro and 10% for Flash. That is what makes a 1M-token default context economical at $0.435/M input.

DeepSeek V4 Pro architecture

V4-Pro is the 1.6T-total-parameter MoE with 49B active per token. Versus V3 (671B total, 37B active), that is 2.4x the total parameter count and 1.3x the active compute per token, with the context window expanded 8x from 128K to 1M. The sparse-attention stack is what keeps the larger model servable: DeepSeek prices Pro at $0.87/M output, 4x the output price of V4-Flash but a fraction of closed-model rates.

DeepSeek V4 Flash architecture

V4-Flash shares the attention design but at 284B total parameters with 13B active per token. The smaller expert pool is why DeepSeek can offer 2500 concurrent requests at $0.14/M input. It also inherits the same 1M context and 384K max output, so the variant choice is about quality per token, not context capability.

What DeepSeek has and has not published

Confirmed by DeepSeek's own release notes: MoE parameter counts (1.6T/49B and 284B/13B), token-wise compression + DSA attention, 1M default context, 384K max output, thinking and non-thinking modes, OpenAI and Anthropic API compatibility. Not published first-party as of June 18, 2026: the CSA compression ratios, indexer internals, optimizer details, and training token counts that circulate in third-party writeups. We cite those as secondary-source where used.

Benchmarks: SWE-bench Verified

The independently tracked number is DeepSeek-V4-Pro-Max at 80.6% on SWE-bench Verified (llm-stats tracker, June 2026). That is the highest open-weights entry, tied with Gemini 3.1 Pro, 0.1 points ahead of MiniMax M3 and 0.2 ahead of Qwen3.7 Max. Closed frontier models score higher: Claude Fable 5 leads at 95.0% (currently suspended, see note).

SWE-bench Verified, top 10 (llm-stats tracker, June 2026)

Model	Score	Output price / 1M tokens
Claude Fable 5	95.0%	$50.00
Claude Mythos Preview	93.9%	restricted access
Claude Opus 4.8	88.6%	$25.00
Claude Opus 4.7	87.6%	$25.00
Claude Opus 4.5	80.9%	$25.00
Claude Opus 4.6	80.8%	$25.00
DeepSeek-V4-Pro-Max	80.6%	$0.87
Gemini 3.1 Pro	80.6%	$12.00
MiniMax M3	80.5%	$1.20
Qwen3.7 Max	80.4%	$2.40 (Qwen3.5-Plus rate)

Read the price column against the score column. The 14.4-point gap between Fable 5 (95.0%) and V4-Pro-Max (80.6%) costs 57x more per output token. The 8-point gap to Opus 4.8 costs 28.7x more. Whether that trade is worth it depends entirely on whether your tasks live in the band those extra points unlock.

DeepSeek-reported launch numbers

At launch, DeepSeek reported V4-Pro-Max scoring 93.5 on LiveCodeBench Pass@1 and a 3206 Codeforces rating. These are vendor-run numbers from the April 2026 release coverage, not independent leaderboard entries; vendor scaffolds routinely score above standardized harnesses (see our SWE-bench Pro breakdown for how large that gap runs).

Does DeepSeek V4 Really Score 80.6%? Self-Reported vs Independent

The 80.6% SWE-bench Verified number is real, but it sits at the top of a harness with a loose verifier. On DeepSWE, a written-from-scratch, contamination-free benchmark, V4-Pro scores 8% pass@1 versus GPT-5.5 at 70% and Opus 4.7 at 54%. The same ordering holds on SWE-bench Pro. Read all three together before trusting any one.

DeepSeek V4-Pro across three harnesses vs frontier (June 2026)

Harness	V4-Pro score	GPT-5.5	Opus 4.7	What it measures
SWE-bench Verified	80.6% (Pro-Max)	not on tracker	87.6%	Patch passes held-out tests; loose verifier
SWE-bench Pro	76.2%	82.6%	82.0%	Harder repos; verifier ~24% false negatives
DeepSWE	8% (reported)	70.0%	54.0%	Written-from-scratch, 91 repos, 5 langs, no contamination

How to reconcile the spread. The yage.ai DeepSWE audit sampled tasks across both benchmarks and an external LLM judge, and found SWE-bench Pro's verifier rejects about 24% of functionally correct solutions and accepts about 8.5% of incorrect ones, while DeepSWE's verifier runs at 1.1% false negatives and 0.3% false positives. A tighter verifier and contamination-free tasks make DeepSWE harder to inflate.

That said, V4-Pro's 8% DeepSWE result is anomalously low relative to peers and to its own 76.2% on SWE-bench Pro. The audit's read is that the gap reflects a real long-horizon-agent capability difference rather than a verifier artifact, because V4-Pro ranks lowest on both independent harnesses. The SWE-bench Pro figures above are the ones reported in the yage.ai audit, not Scale's SEAL board (which still has no V4 entry); treat the Opus 4.7 and GPT-5.5 numbers as reported by the DeepSWE and audit trackers, not independently re-run by us.

DeepSeek V4 Flash and SWE-bench: What Exists

No independently run SWE-bench score for deepseek-v4-flash has been published as of June 18, 2026. Specifically:

Scale's SEAL SWE-bench Pro leaderboards (public and private sets) list no DeepSeek V4 entry of any variant. The top open-weights entry there is qwen3-coder-480b-a35b at 38.7% on the public set.
The llm-stats SWE-bench Verified tracker lists DeepSeek-V4-Pro-Max (80.6%) but no Flash entry.

If you see a Flash SWE-bench number quoted, check whether it is a vendor-run scaffold result or a community harness; neither currently appears on the two trackers above. For agentic coding where benchmark evidence exists, the published data points at Pro, not Flash.

Cost Math vs Opus 4.8, GPT-5.5, Gemini 3.1 Pro

Reddit and X discussion of V4 settled on a "17x cheaper" shorthand. The exact ratios depend on which token type you compare. At list prices:

V4-Pro list-price ratios vs frontier models (June 2026)

Model	Input / 1M	Output / 1M	Output tokens per $1	Output cost vs V4-Pro
DeepSeek V4-Flash	$0.14	$0.28	3.57M	0.32x
DeepSeek V4-Pro	$0.435	$0.87	1.15M	1x
MiniMax M3 (≤512K)	$0.30	$1.20	833K	1.4x
Gemini 3.1 Pro (≤200K)	$2.00	$12.00	83K	13.8x
GPT-5.4	$2.50	$15.00	67K	17.2x
Claude Sonnet 4.6	$3.00	$15.00	67K	17.2x
Claude Opus 4.8	$5.00	$25.00	40K	28.7x
GPT-5.5	$5.00	$30.00	33K	34.5x
Claude Fable 5	$10.00	$50.00	20K	57.5x

So "17x cheaper" is accurate against GPT-5.4 and Sonnet 4.6 output. Against Opus 4.8 it is 28.7x on output and 11.5x on input. Against Fable 5 it is 57.5x on output. Cached input widens every gap: a V4-Pro cache hit costs $0.003625/M vs $0.50/M for an Opus 4.8 cache hit, 138x apart.

A concrete daily workload, 20 requests of 50K input + 10K output (1M input, 200K output per day), assuming zero cache hits:

V4-Flash: $0.20/day, about $6/month
V4-Pro: $0.61/day, about $18/month
Opus 4.8: $10.00/day, about $300/month
GPT-5.5: $11.00/day, about $330/month
Claude Fable 5: $20.00/day, about $600/month

Full Anthropic-side rates, including cache and batch multipliers, are in our Anthropic API pricing guide.

Where to Run It: First-Party API vs OpenRouter

The first-party API (api.deepseek.com) is the reference deployment: $0.435/$0.87 for Pro, $0.14/$0.28 for Flash, with the cache-hit discounts and the 500/2500 concurrency limits above. OpenRouter lists deepseek/deepseek-v4-pro at the same $0.435/M input and $0.87/M output with 1M context, routing across hosts (DeepInfra and NVIDIA also serve V4 endpoints; per-host rates vary by routing tier, so check the live OpenRouter provider list before committing volume).

Reasons to pick first-party: documented cache-hit pricing at $0.003625/M (Pro) and the Anthropic-format endpoint. Reasons to pick OpenRouter: one key across models, provider failover, and easy A/B against MiniMax M3 or Qwen3.5 at the prices in the table above. Because the weights are open, you can also self-host; Flash at 284B total is the realistic target, Pro at 1.6T is multi-node territory.

Output fidelity is where serverless hosts diverge. Most serverless providers quantize activations to fp8 to cut serving cost, which moves output away from the reference weights. Morph Open Source Models serves DeepSeek with 16-bit (bf16) activations and does not quantize activations to fp8, so output matches the published weights. For coding agents specifically, Morph adds codegen-tuned speculative decoding plus custom low-level inference kernels built for code generation, which makes it the fastest and highest-quality option for codegen. morph-dsv4flash (DeepSeek V4 Flash) runs at $0.139/M input and $0.278/M output; see pricing for the full list.

Use DeepSeek V4 in Claude Code and OpenCode

Claude Code

DeepSeek's API speaks the Anthropic format natively, so Claude Code needs only environment variables, no proxy:

export ANTHROPIC_BASE_URL="https://api.deepseek.com/anthropic"
export ANTHROPIC_AUTH_TOKEN="sk-your-deepseek-key"
export ANTHROPIC_MODEL="deepseek-v4-pro"
export ANTHROPIC_SMALL_FAST_MODEL="deepseek-v4-flash"
claude

Pointing the small-fast model at deepseek-v4-flash keeps background tasks on the $0.28/M-output tier. For routing multiple providers behind one endpoint instead, see Claude Code with LiteLLM.

OpenCode

OpenCode ships a DeepSeek provider. Run opencode auth login, select DeepSeek, paste your API key, then set the model in opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "model": "deepseek/deepseek-v4-pro"
}

Codex CLI

Codex's custom model_providers config only supports wire_api = "responses", and DeepSeek's API exposes ChatCompletions and Anthropic formats, not the Responses API. To drive V4 from Codex, route through a Responses-compatible proxy or use OpenRouter-backed tooling; for direct use, Claude Code and OpenCode are the paths that work without translation.

What Changed from DeepSeek V3

DeepSeek V3 vs V4-Pro

Dimension	DeepSeek V3	DeepSeek V4-Pro
Total parameters	671B	1.6T (2.4x larger)
Active parameters	37B per token	49B per token
Context window	128K tokens	1M tokens (8x larger)
Attention	Multi-head Latent Attention (MLA)	Token-wise compression + DSA
Max output	8K tokens	384K tokens
API formats	OpenAI-compatible	OpenAI + Anthropic formats
Endpoints	deepseek-chat / deepseek-reasoner	deepseek-v4-pro / deepseek-v4-flash

The structural shift is the attention stack: replacing MLA with token-wise compression plus DSA is what moves the default context from 128K to 1M without 1M-context pricing. The legacy V3-era endpoints survive only as aliases onto deepseek-v4-flash until July 24, 2026.

Limitations

Known limitations

Preview status: the April 24, 2026 release is labeled a preview. Specs and pricing held through June, but GA may change behavior.
8-point gap to the frontier on SWE-bench Verified: V4-Pro-Max scores 80.6% vs Opus 4.8's 88.6% and Fable 5's 95.0%. For tasks where those points matter, the cheap model retries its way into costing you time instead of money.
No independent SWE-bench Pro entry: Scale's SEAL leaderboard has no DeepSeek V4 result, so agentic performance under a standardized harness is unverified.
Sparse first-party architecture docs: compression ratios and indexer internals circulate only in third-party analyses.
Self-hosting Pro is heavy: 1.6T total parameters means multi-node inference even quantized. Flash at 284B is the practical self-host target.

Frequently Asked Questions

When was DeepSeek V4 released?

April 24, 2026, as a preview, with V4-Pro and V4-Flash shipping the same day. The legacy deepseek-chat and deepseek-reasoner endpoints now alias to deepseek-v4-flash and retire after July 24, 2026 15:59 UTC.

What is the DeepSeek V4 architecture?

A mixture-of-experts transformer with token-wise compression plus DeepSeek Sparse Attention (DSA). Pro: 1.6T total / 49B active. Flash: 284B total / 13B active. Both: 1M context, 384K max output. Third-party analyses add 4x KV compression and a top-1,024 lightning indexer, which DeepSeek has not published first-party.

What does DeepSeek V4 cost?

Official API: Flash $0.14/M input (miss), $0.0028/M (cache hit), $0.28/M output. Pro $0.435/M input (miss), $0.003625/M (cache hit), $0.87/M output. OpenRouter lists Pro at the same $0.435/$0.87.

What is deepseek-v4-flash's SWE-bench score?

None published independently as of June 18, 2026. Scale SEAL and llm-stats both lack a Flash entry. The only independent V4 number is Pro-Max at 80.6% SWE-bench Verified.

Does DeepSeek V4 really score 80.6% on SWE-bench?

On SWE-bench Verified, yes: V4-Pro-Max posts 80.6% (llm-stats), but that harness has a loose verifier. On DeepSWE, a contamination-free written-from-scratch benchmark, V4-Pro scores 8% pass@1 versus GPT-5.5 at 70% and Opus 4.7 at 54%. The audit reads the gap as a real capability difference, not a measurement artifact.

How does V4-Pro-Max compare to Claude on SWE-bench Verified?

V4-Pro-Max 80.6% vs Opus 4.6 80.8%, Opus 4.8 88.6%, Fable 5 95.0% (llm-stats, June 2026). Per output token V4-Pro costs $0.87 vs $25 for Opus 4.8 and $50 for Fable 5.

Is V4 open source?

Open weights on Hugging Face under MIT (per Lambda's launch analysis; the HF model cards require accepting access terms). Download, run, fine-tune.

Can I use V4 in Claude Code?

Yes. Set ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic, ANTHROPIC_AUTH_TOKEN to your DeepSeek key, and ANTHROPIC_MODEL=deepseek-v4-pro. The full snippet is in the setup section above.

Use WarpGrep with DeepSeek V4 for Better Code Search Context

WarpGrep is an agentic code search tool that works as an MCP server. Connect it to any DeepSeek-powered agent for high-precision codebase context, so V4's 1M-token window gets filled with the right code, not noise. Free for 100k requests, then $1 per 1M.

Try WarpGrep Free

See How It Works

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

DeepSeek V4: 1.6T MoE, 1M Context, $0.87/M Output (2026 Guide)