DeepSeek V4: Architecture, Benchmarks, and API Guide (2026)

DeepSeek V4 launched April 24, 2026 with two models: V4-Pro (1.6T params, 49B active) and V4-Flash (284B, 13B active). Both support 1M-token context. V4-Pro-Max scores 80.6% SWE-bench Verified and 93.5 LiveCodeBench. Full architecture breakdown, benchmarks, and pricing.

April 24, 2026 · 1 min read

TL;DR

DeepSeek V4 launched April 24, 2026 with two models. V4-Pro is a 1.6T-parameter MoE with 49B active per token. V4-Flash is a 284B MoE with 13B active. Both support 1M-token context and ship under the MIT license. V4-Pro-Max scores 80.6% on SWE-bench Verified and 93.5 on LiveCodeBench, the highest coding benchmark score of any model. V4-Flash costs $0.14/M input tokens. V4-Pro costs $1.74/M input tokens.

What it is

Two MoE models (Pro and Flash) with 1M-token context, hybrid CSA+HCA attention that cuts FLOPs to 27% and KV cache to 10% of V3.2, Muon optimizer, MIT license, weights on Hugging Face.

Why it matters

V4-Pro-Max matches Claude Opus 4.6 on SWE-bench (80.6% vs 80.8%) and leads all models on LiveCodeBench (93.5). V4-Pro costs $3.48/M output vs Claude's $75. V4-Flash is even cheaper at $0.28/M output. Open weights under MIT.

Key Specs at a Glance

V4-Pro

1.6T
Total parameters (MoE)
49B
Active parameters per token
1M
Context window (tokens)

V4-Flash

284B
Total parameters (MoE)
13B
Active parameters per token
1M
Context window (tokens)
SpecificationV4-ProV4-Flash
Total parameters1.6 trillion284 billion
Active parameters49B per token13B per token
Context window1,000,000 tokens1,000,000 tokens
Training data33T tokens33T tokens
Attention mechanismHybrid CSA + HCAHybrid CSA + HCA
OptimizerMuon (AdamW for embeddings)Muon (AdamW for embeddings)
Input price / 1M tokens$1.74$0.14
Output price / 1M tokens$3.48$0.28
LicenseMITMIT
WeightsHugging FaceHugging Face

Architecture: Hybrid Attention and Muon

V4 is not a simple scale-up of V3. The attention mechanism, optimizer, and training pipeline all changed. The result: V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to V3.2 in the 1M-token context setting.

1. Compressed Sparse Attention (CSA)

CSA compresses KV caches along the sequence dimension at a 4x compression rate, then applies sparse attention. A lightning indexer selects the top 1,024 most relevant compressed KV entries for each query. A sliding window of 128 tokens provides local context for each layer. This gives the model detailed, selective access to the most relevant parts of long contexts without the O(n²) cost of full attention.

2. Heavily Compressed Attention (HCA)

HCA applies a much more aggressive 128x compression rate but then performs dense attention over the compressed representation. This gives the model a cheap, global view of distant tokens in every layer. Where CSA is selective and detailed, HCA is broad and approximate. CSA and HCA layers are interleaved through the network, so the model alternates between focused retrieval and wide-angle context awareness.

CSA: Selective and detailed

4x KV compression, top-1024 entry selection per query, 128-token sliding window. Precise retrieval of the most relevant context without quadratic cost.

HCA: Broad and cheap

128x KV compression with dense attention. Gives every layer a global view of the full context at minimal cost. Captures distant dependencies that CSA's selection might miss.

3. Manifold-Constrained Hyper-Connections (mHC)

mHC upgrades residual connections for numerical stability in deep stacks. Standard residual connections can suffer from signal amplification or collapse at trillion-parameter scale. mHC constrains the mixing matrices to the Birkhoff Polytope using the Sinkhorn-Knopp algorithm, which preserves signal magnitude through the network.

4. Muon Optimizer

V4 switches from AdamW to the Muon optimizer for most parameters. DeepSeek reports faster convergence and more stable training at trillion-parameter scale. AdamW is retained for embeddings, the prediction head, and RMSNorm weights. Peak learning rate: 2.0e-4 for Pro, with cosine decay.

5. FP4 Quantization-Aware Training

FP4 quantization-aware training was applied to MoE expert weights and the indexer QK path during pre-training. This reduces memory requirements and enables more efficient inference without the quality loss that comes from post-training quantization.

CSA + HCA

Hybrid attention that cuts FLOPs to 27% and KV cache to 10% of V3.2. CSA selects top-1024 entries; HCA provides cheap global context.

mHC

Stable trillion-scale training via constrained mixing matrices on the Birkhoff Polytope. Prevents signal explosion in deep networks.

Muon + FP4 QAT

Muon optimizer for faster convergence. FP4 quantization-aware training on expert weights for efficient inference without quality loss.

Benchmark Performance

V4-Pro-Max is V4-Pro with extended reasoning tokens. It achieves the highest LiveCodeBench and Codeforces scores of any model and comes within 0.2 points of Claude Opus 4.6 on SWE-bench Verified.

BenchmarkV4-Pro-MaxClaude Opus 4.6GPT-5.4 xHighGemini 3.1 Pro
SWE-bench Verified80.6%80.8%
LiveCodeBench Pass@193.588.891.7
Codeforces Rating320631683052
BenchmarkV4-Pro-MaxClaude Opus 4.6GPT-5.4Gemini 3.1 Pro
MMLU-Pro87.5%91.0%
GPQA Diamond90.1%94.3%
HLE37.7%40.0%
HMMT 202695.2%96.2%
Putnam 2025120/120

V4-Pro-Max leads on coding benchmarks (LiveCodeBench, Codeforces) and achieves near-parity with Claude Opus 4.6 on SWE-bench. Claude holds a meaningful lead on HLE (40.0% vs 37.7%) and HMMT 2026 math (96.2% vs 95.2%). Gemini 3.1 Pro leads on MMLU-Pro and GPQA Diamond. The gap between these models is narrowing: the top four are all within single-digit percentage points on most benchmarks.

For Coding: How Good Is It?

V4-Pro-Max achieves the highest LiveCodeBench Pass@1 score of any model (93.5) and a Codeforces rating of 3206, ahead of GPT-5.4 xHigh (3168) and Gemini 3.1 Pro (3052). On SWE-bench Verified, it scores 80.6%, trailing Claude Opus 4.6 by 0.2 points.

93.5
LiveCodeBench Pass@1 (highest)
80.6%
SWE-bench Verified
3206
Codeforces rating (highest)

Long-Context Code Tasks

The CSA+HCA attention mechanism is designed for long-context inference. At 1M tokens, V4-Pro uses 27% of the FLOPs and 10% of the KV cache compared to V3.2. CSA selects the 1,024 most relevant compressed KV entries per query, so the model focuses compute on the parts of the codebase that matter for each step. HCA maintains a compressed global view of the full context in every layer.

Cost Per Coding Task

V4-Pro-Max at $3.48/M output tokens delivers SWE-bench performance within 0.2 points of Claude Opus 4.6 at $75/M output tokens. That is a 21x cost reduction at near-identical coding benchmark performance. For teams running thousands of agentic coding tasks per day, this changes what is economically feasible.

V4-Flash at $0.28/M output is even cheaper. It approaches V4-Pro quality on general coding tasks with a 2-3 point gap, though it falls further behind on agentic coding (7-10 points on SWE-Pro and Terminal-Bench).

What Changed from DeepSeek V3

DimensionDeepSeek V3DeepSeek V4-Pro
Total parameters671B1.6T (2.4x larger)
Active parameters37B per token49B per token
Context window128K tokens1M tokens (8x larger)
Attention mechanismStandard MLAHybrid CSA + HCA
Residual connectionsStandardmHC (manifold-constrained)
OptimizerAdamWMuon (AdamW for embeddings)
Training data14.8T tokens33T tokens
Inference FLOPs (1M ctx)Baseline27% of V3.2
KV cache (1M ctx)Baseline10% of V3.2
LicenseModified OpenRAILMIT

The efficiency gains are the most significant change. Despite being 2.4x larger overall and activating 1.3x more parameters per token, V4-Pro requires less compute and memory per inference step at long context lengths than V3.2. The CSA+HCA attention mechanism is the primary driver: by compressing and selecting KV entries rather than attending to all of them, V4-Pro scales sublinearly with context length.

API Pricing

ModelInput (per 1M tokens)Output (per 1M tokens)Context
DeepSeek V4-Flash$0.14$0.281M tokens
DeepSeek V4-Pro$1.74$3.481M tokens
Claude Opus 4.6$15$751M tokens
GPT-5.4~$15~$60128K tokens
Gemini 3.1 Pro~$3.50~$10.501M tokens

V4-Pro-Max achieves SWE-bench parity with Claude Opus 4.6 at 1/21x the output token cost ($3.48 vs $75). V4-Flash is 268x cheaper than Claude on input tokens ($0.14 vs $15) and approaches V4-Pro quality on most tasks with a 2-3 point gap.

For a typical coding session (50K input, 10K output per request, 20 requests/day):

  • V4-Flash: ~$0.20/day ($6/month)
  • V4-Pro: ~$2.43/day ($73/month)
  • Claude Opus 4.6: ~$30/day ($900/month)

V4-Pro vs V4-Flash: Which to Use

V4-Flash approaches V4-Pro quality on general tasks with a 2-3 point benchmark gap. The gap widens to 7-10 points on agentic coding tasks (SWE-Pro, Terminal-Bench) that require multi-step reasoning and tool use.

Use CaseRecommendationWhy
General coding tasksV4-Flash2-3 point gap at 12x lower cost
Agentic coding / SWE-bench-style tasksV4-Pro or V4-Pro-Max7-10 point gap on agentic benchmarks
Cost-sensitive batch processingV4-Flash$0.14/M input is among the cheapest frontier-tier options
Maximum coding accuracyV4-Pro-Max93.5 LiveCodeBench, 80.6% SWE-bench Verified
Long-context retrievalEitherBoth support 1M tokens with the same CSA+HCA mechanism

Limitations

Known limitations

  • Preview release: Both models are marked as preview. Performance may change in the general availability release.
  • Reasoning benchmarks: V4-Pro-Max trails Gemini 3.1 Pro on MMLU-Pro (87.5% vs 91.0%) and GPQA Diamond (90.1% vs 94.3%). It trails Claude Opus 4.6 on HLE (37.7% vs 40.0%).
  • Agentic gap for Flash: V4-Flash falls 7-10 points behind V4-Pro on agentic coding benchmarks (SWE-Pro, Terminal-Bench). The cheap model is not a drop-in replacement for the expensive one on complex tasks.
  • Self-hosting requirements: V4-Pro at 1.6T total parameters requires significant hardware for self-hosting even with quantization. V4-Flash at 284B is more accessible.

Frequently Asked Questions

What models are in the DeepSeek V4 release?

Two models: V4-Pro (1.6T total, 49B active) and V4-Flash (284B total, 13B active). Both support 1M-token context and are MIT-licensed. V4-Pro-Max is V4-Pro with extended reasoning tokens for higher benchmark scores.

How does V4-Pro-Max compare to Claude Opus 4.6?

On SWE-bench Verified: V4-Pro-Max 80.6%, Claude Opus 4.6 80.8% (a 0.2 point gap). On LiveCodeBench: V4-Pro-Max 93.5, Claude Opus 4.6 Max 88.8. On HLE: Claude leads 40.0% vs 37.7%. On HMMT 2026 math: Claude leads 96.2% vs 95.2%. V4-Pro-Max costs $3.48/M output vs Claude's $75/M output.

What changed architecturally from V3?

Standard attention replaced with hybrid CSA+HCA. Muon optimizer replaced AdamW for most parameters. FP4 quantization-aware training applied to expert weights. mHC added for training stability. Training data increased from 14.8T to 33T tokens. The result: 27% of inference FLOPs and 10% of KV cache vs V3.2 at 1M context.

What does V4-Flash cost?

$0.14/M input tokens and $0.28/M output tokens. It approaches V4-Pro quality on general tasks (2-3 point gap) but falls further behind on agentic coding (7-10 point gap on SWE-Pro and Terminal-Bench).

Is V4 open source?

Yes. Both models are released under the MIT license with weights on Hugging Face. No commercial restrictions.

What is the context window?

1 million tokens for both V4-Pro and V4-Flash. The CSA+HCA hybrid attention mechanism makes this practical: V4-Pro uses only 10% of the KV cache that V3.2 would need at 1M context.

Related Articles

Use WarpGrep with DeepSeek V4 for Better Code Search Context

WarpGrep is an agentic code search tool that works as an MCP server. Connect it to any DeepSeek-powered agent for high-precision codebase context, so V4's 1M-token window gets filled with the right code, not noise.

Sources