DeepSeek V4: Architecture, Benchmarks, and API Guide (2026)

DeepSeek V4 launches in March 2026: a 1T-parameter MoE model with Engram conditional memory, 1M-token context, native multimodal input, and pre-release benchmark claims of 80-85% SWE-bench and 90% HumanEval. Full spec breakdown, API pricing, and comparison to V3.

March 3, 2026 · 1 min read

TL;DR

DeepSeek V4 launches this week. It is a 1-trillion-parameter MoE model with a 1M-token context window, three new architectural techniques, and native multimodal support. Pre-release claims put it at 80-85% SWE-bench Verified and 90% HumanEval. API pricing is expected around $0.14/M input tokens, roughly 20-50x cheaper than Western frontier models. The numbers are from internal DeepSeek benchmarks only. Independent evaluations will follow.

What it is

A 1T-parameter mixture-of-experts LLM with native multimodal input (text, image, video), 1M-token context, Engram conditional memory, and three architectural improvements over V3. Open-weight license expected.

Why it matters

If the benchmark claims hold under independent testing, V4 would match or beat the current SWE-bench record (Claude Opus 4.5 at 80.9%) at 20-50x lower cost. That changes the economics of AI-assisted software development.

Pre-release note

DeepSeek V4 is releasing this week (March 3, 2026). Benchmark numbers in this guide come from DeepSeek's internal testing and pre-release leaks unless explicitly noted otherwise. Independent evaluations are not yet available. Treat specific scores as directional until third-party results are published.

Key Specs at a Glance

1T
Total parameters (MoE)
32B
Active parameters per token
1M
Context window (tokens)
90%
HumanEval (pre-release claim)
80-85%
SWE-bench Verified (pre-release claim)
$0.14
Expected input cost per 1M tokens
SpecificationValueNotes
Total parameters~1 trillionMoE architecture
Active parameters~32 billionPer token, lower than V3's 37B
Context window1,000,000 tokens8x larger than V3's 128K
ArchitectureMoE + Engram + mHCThree novel innovations over V3
MultimodalNative (text, image, video)Integrated from pre-training
HumanEval (claimed)~90%Internal benchmark, unverified
SWE-bench Verified (claimed)80-85%Internal benchmark, unverified
Expected API input price$0.14 / 1M tokensProjected, not confirmed
Hardware optimizationHuawei Ascend, CambriconFirst frontier model optimized for Chinese chips
LicenseOpen-weight (expected)Consistent with V3/V3.2 practice

Architecture: Three Innovations

V4 is not a simple scale-up of V3. Three architectural changes distinguish it: Engram conditional memory, Manifold-Constrained Hyper-Connections (mHC), and DeepSeek Sparse Attention. Each solves a specific problem that appeared as models scaled beyond 671B parameters.

1. Engram Conditional Memory

Engram separates static knowledge retrieval from dynamic neural reasoning. Named after the neuroscience term for memory trace, it uses a hash-based lookup table stored in DRAM rather than GPU VRAM. When the model encounters static patterns, such as syntax rules, entity names, or library function signatures, Engram retrieves them in O(1) time rather than running them through attention layers.

The problem Engram solves: standard transformers waste GPU compute reconstructing simple factual patterns on every forward pass. Engram offloads 20-25% of sparse parameters to this lookup system, freeing compute for actual reasoning. A 27B test model with Engram showed 3-5 point benchmark improvements across knowledge, reasoning, and coding tasks. Needle-in-a-Haystack accuracy jumped from 84.2% to 97%.

Standard transformer memory

All knowledge stored in learned weights. Every forward pass must reconstruct static facts through attention. O(n) complexity scales with context length. GPU VRAM handles everything.

Engram memory (V4)

Static knowledge offloaded to hash-based DRAM lookup. O(1) retrieval regardless of context length. GPU reserved for reasoning. Benchmark improvements of 3-5 points in testing.

2. Manifold-Constrained Hyper-Connections (mHC)

Training a trillion-parameter model is unstable. Standard residual connections maintain training stability by preserving identity mapping through layers. Hyper-Connections extended this with multiple parallel information streams, but broke the identity mapping guarantee, causing catastrophic signal amplification at scale.

mHC fixes this by constraining the mixing matrices to the Birkhoff Polytope, a mathematical manifold of doubly stochastic matrices. The Sinkhorn-Knopp algorithm enforces the constraint. Doubly stochastic matrices preserve signal magnitude, so residual streams neither explode nor collapse. The overhead is only 6-7% extra training compute.

3. DeepSeek Sparse Attention

With a 1M-token context window, standard full attention is computationally prohibitive. DeepSeek Sparse Attention is a custom attention mechanism designed for long-context efficiency. Combined with the Engram O(1) memory system, V4 can handle 1M-token contexts without quadratic attention cost.

Engram

O(1) static knowledge lookup. Offloads 20-25% of sparse parameters to DRAM hash tables. 3-5 point benchmark gains and 97% Needle-in-a-Haystack accuracy.

mHC

Stable trillion-scale training via constrained mixing matrices on the Birkhoff Polytope. Only 6-7% training overhead vs unconstrained hyper-connections.

Sparse Attention

Custom attention for 1M-token contexts. Works with Engram to eliminate quadratic scaling on long-context tasks like full-codebase reasoning.

Benchmark Performance

Benchmark caveat

All V4 scores below come from pre-release internal DeepSeek benchmarks or leaks. No independent evaluation exists yet. Scores for V3, Claude Opus 4.6, and GPT-5.3 Codex are from published results. Compare carefully.

BenchmarkDeepSeek V4 (claimed)Claude Opus 4.6GPT-5.3 CodexDeepSeek V3.2
SWE-bench Verified80-85%80.8%77.3%~65%
HumanEval~90%~88%~85%~80%
Context window1M tokens1M tokens (beta)128K tokens128K tokens
Active params32BN/A (dense)N/A (dense)37B
Input price / 1M tokens~$0.14 (projected)~$15~$15$0.27

The SWE-bench Verified claim is the one to watch. Claude Opus 4.5 was the first model to crack 80% in that benchmark. If V4 exceeds 80.8% at $0.14/M input tokens, it changes the cost structure for agentic coding workloads by an order of magnitude.

For Coding: How Good Is It?

DeepSeek V4 was built with coding as a primary target. According to Reuters and The Information, internal benchmarks show V4 outperforming both Claude and GPT series specifically on extremely long code prompts, which is where the 1M-token context and Engram memory most directly apply.

Long-Context Code Tasks

Most coding benchmarks test short, self-contained problems. V4's architecture specifically targets long-context software engineering: understanding large codebases, tracing dependencies across files, maintaining coherence over multi-step refactors. The Engram memory system means the model can hold static facts about libraries and syntax in O(1) lookup while applying attention to the actual reasoning problem.

SWE-bench Verified Score

SWE-bench Verified tests real GitHub issues: the model must read a real codebase, understand the bug, write a patch, and pass the existing test suite. Claude Opus 4.5 holds the current record at 80.9%. V4's pre-release claim of 80-85% would either match that or set a new record. Cost per task is estimated at $0.03 vs $0.72 for Claude, a 24x cost reduction if the performance claims hold.

80-85%
SWE-bench Verified (pre-release claim)
90%
HumanEval (pre-release claim)
24x
Lower cost per task vs Claude (estimated)

Hardware and Speed

V4 is optimized for Huawei Ascend and Cambricon chips rather than Nvidia H100/H200. The Ascend 910C delivers roughly 60% of H100's peak FP16 performance, but 1.8x better performance-per-watt. Because V4 activates only 32B parameters per token despite a 1T total, inference is cheaper than a dense model of equivalent capability.

What Changed from DeepSeek V3

DimensionDeepSeek V3DeepSeek V4
Total parameters671B~1T (50% larger)
Active parameters37B per token~32B per token (cheaper inference)
Context window128K tokens1M tokens (8x larger)
Memory systemStandard attention weightsEngram O(1) conditional memory
Residual connectionsStandard hyper-connectionsmHC (manifold-constrained)
MultimodalText onlyNative text, image, video
Experts per token~8 experts16 experts (pre-release report)
Hardware targetNvidia H800Huawei Ascend, Cambricon
Training cost$5.6M (2.788M GPU hours)Not disclosed

The active parameter drop from 37B to 32B is counterintuitive: a bigger total model activates fewer parameters per token. This is intentional. Engram offloads static knowledge to hash lookups, so the neural network handles a narrower set of reasoning tasks. The result should be faster inference at lower cost despite the larger total model size.

API Pricing

V4 pricing has not been officially announced. Based on V3's pricing trajectory and pre-release reports, projected pricing is $0.14/M input tokens and $0.28/M output tokens, with cached input at $0.07/M tokens.

ModelInput (per 1M tokens)Output (per 1M tokens)Context
DeepSeek V4 (projected)~$0.14~$0.281M tokens
DeepSeek V3.2$0.27$1.10128K tokens
Claude Opus 4.6~$15~$751M tokens (beta)
GPT-5.3 Codex~$15~$60128K tokens
Gemini 3 Pro~$3.50~$10.501M tokens

The pricing advantage comes from MoE architecture (only 32B parameters active per token), Engram's offloading of static retrieval to DRAM, and DeepSeek's use of Huawei Ascend chips, which cost less per inference hour than Nvidia A100/H100 clusters. This is not dumping. It is a structural cost difference.

V4 Lite

Pre-release reports mention a V4 Lite variant with 200B parameters and a 1M-token context window. This would be the production workhorse for cost-sensitive applications, while the full 1T version handles high-complexity tasks. Pricing for V4 Lite has not been disclosed.

Community Reaction

Developer communities reacted with the same combination of excitement and skepticism that met DeepSeek V3 and R1 on release.

What Developers Are Saying

The r/LocalLLaMA thread on V4 scored 308 points within hours. The dominant sentiment: excitement about the 1M-token context and Engram architecture, skepticism about unverified internal benchmarks. A recurring comment: "DeepSeek is the disruption we need" — pointing to open weights as the differentiator from OpenAI and Anthropic's closed ecosystem shift.

Critics on r/LocalLLaMA and r/Singularity raised the standard objections: DeepSeek's reasoning models waste compute on simple tasks, and internal benchmarks do not reflect real-world messiness. Both observations apply to every vendor's pre-release numbers.

Hardware Community Focus

Tom's Hardware and the hardware community focused on the Huawei Ascend optimization. This is the first frontier model optimized for non-Nvidia hardware at launch. V4 runs best on Ascend and Cambricon chips. Nvidia Blackwell compatibility exists but is secondary to the Chinese chip stack. This is a deliberate supply chain choice that reduces V4's dependency on U.S. export-controlled hardware.

On HN

The Hacker News discussion centered on the Engram architecture paper. Commenters with ML backgrounds flagged the Needle-in-a-Haystack jump from 84.2% to 97% as the most technically credible claim in the pre-release materials. Engram solves a real problem and has a published paper (arxiv:2601.07372). The benchmark numbers are less verifiable before release, but the architecture is sound.

Limitations

Known limitations and open questions

  • Benchmarks unverified: All V4 performance numbers come from DeepSeek internal testing. Independent evaluations from LMSYS, ARC, or third-party researchers have not been published.
  • Hardware availability: V4 is optimized for Huawei Ascend and Cambricon. Running it on Nvidia hardware at launch may not achieve the same performance or cost profile as reported.
  • Multimodal quality unknown: V4 is the first DeepSeek model with native video and image generation. No benchmarks for image or video quality have been released.
  • Export controls: U.S. export controls limit access to DeepSeek's API in some jurisdictions. Self-hosting avoids API restrictions but requires significant GPU resources for a 1T model.
  • V4 Lite details unclear: The 200B Lite variant has been mentioned in reports but not formally announced. Pricing and availability are unconfirmed.
  • Release date slipped before: The mid-February and late-February windows both passed without release. March 2026 has the strongest signal yet (Financial Times confirmed), but treat with appropriate uncertainty.

Frequently Asked Questions

When is DeepSeek V4 releasing?

The first week of March 2026, per Financial Times reporting on February 27. The release is timed to China's Two Sessions parliamentary meetings beginning March 4. Prior windows (mid-February, late February) passed without release.

How many parameters does DeepSeek V4 have?

Approximately 1 trillion total, with roughly 32 billion active per token. V4 uses mixture-of-experts routing with 16 expert pathways. The total is 50% larger than V3 (671B), but active parameters per token are lower (32B vs 37B), which keeps inference cost down.

What is Engram memory?

Engram is a conditional memory module that replaces attention-based retrieval for static knowledge. It uses hash-based O(1) lookups in DRAM instead of GPU VRAM. A published paper (arxiv:2601.07372) shows 3-5 point benchmark improvements and Needle-in-a-Haystack accuracy jumping from 84.2% to 97% on a 27B test model.

What are V4's SWE-bench scores?

Pre-release internal benchmarks claim 80-85% on SWE-bench Verified. Claude Opus 4.5 currently holds the record at 80.9%. These numbers are from DeepSeek's own testing only. Wait for independent verification before making infrastructure decisions based on them.

What is V4's API pricing?

Projected: $0.14/M input tokens, $0.28/M output tokens, $0.07/M cached input. Official pricing will be on DeepSeek's API docs at launch. Even if final prices come in higher, V4 will be significantly cheaper than Claude Opus 4.6 or GPT-5.3.

Is V4 open source?

Expected to be open-weight under a modified OpenRAIL-style license, consistent with V3 and V3.2. Code repository typically goes MIT. Model weights allow commercial use with some restrictions. Exact license text will be on the Hugging Face repo at release.

How does V4 differ from V3?

Three major changes: context window expanded 8x to 1M tokens, Engram memory added, and mHC stabilizes trillion-scale training. Plus native multimodal support (V3 was text-only) and optimization for Huawei Ascend chips instead of Nvidia H800. Active parameters per token actually dropped from 37B to 32B despite the larger total.

Can I use V4 for long codebase tasks?

The 1M-token context and Engram architecture are specifically designed for this. V4 can hold an entire large codebase in context while using O(1) lookup for static patterns like library APIs and syntax. According to sources cited by Reuters and The Information, V4 outperforms competitors specifically on long-form code prompts.

Related Articles

Use WarpGrep with DeepSeek V4 for Better Code Search Context

WarpGrep is an agentic code search tool that works as an MCP server. Connect it to any DeepSeek-powered agent for high-precision codebase context, so V4's 1M-token window gets filled with the right code, not noise.

Sources