TL;DR
DeepSeek V4 launches this week. It is a 1-trillion-parameter MoE model with a 1M-token context window, three new architectural techniques, and native multimodal support. Pre-release claims put it at 80-85% SWE-bench Verified and 90% HumanEval. API pricing is expected around $0.14/M input tokens, roughly 20-50x cheaper than Western frontier models. The numbers are from internal DeepSeek benchmarks only. Independent evaluations will follow.
What it is
A 1T-parameter mixture-of-experts LLM with native multimodal input (text, image, video), 1M-token context, Engram conditional memory, and three architectural improvements over V3. Open-weight license expected.
Why it matters
If the benchmark claims hold under independent testing, V4 would match or beat the current SWE-bench record (Claude Opus 4.5 at 80.9%) at 20-50x lower cost. That changes the economics of AI-assisted software development.
Pre-release note
DeepSeek V4 is releasing this week (March 3, 2026). Benchmark numbers in this guide come from DeepSeek's internal testing and pre-release leaks unless explicitly noted otherwise. Independent evaluations are not yet available. Treat specific scores as directional until third-party results are published.
Key Specs at a Glance
| Specification | Value | Notes |
|---|---|---|
| Total parameters | ~1 trillion | MoE architecture |
| Active parameters | ~32 billion | Per token, lower than V3's 37B |
| Context window | 1,000,000 tokens | 8x larger than V3's 128K |
| Architecture | MoE + Engram + mHC | Three novel innovations over V3 |
| Multimodal | Native (text, image, video) | Integrated from pre-training |
| HumanEval (claimed) | ~90% | Internal benchmark, unverified |
| SWE-bench Verified (claimed) | 80-85% | Internal benchmark, unverified |
| Expected API input price | $0.14 / 1M tokens | Projected, not confirmed |
| Hardware optimization | Huawei Ascend, Cambricon | First frontier model optimized for Chinese chips |
| License | Open-weight (expected) | Consistent with V3/V3.2 practice |
Architecture: Three Innovations
V4 is not a simple scale-up of V3. Three architectural changes distinguish it: Engram conditional memory, Manifold-Constrained Hyper-Connections (mHC), and DeepSeek Sparse Attention. Each solves a specific problem that appeared as models scaled beyond 671B parameters.
1. Engram Conditional Memory
Engram separates static knowledge retrieval from dynamic neural reasoning. Named after the neuroscience term for memory trace, it uses a hash-based lookup table stored in DRAM rather than GPU VRAM. When the model encounters static patterns, such as syntax rules, entity names, or library function signatures, Engram retrieves them in O(1) time rather than running them through attention layers.
The problem Engram solves: standard transformers waste GPU compute reconstructing simple factual patterns on every forward pass. Engram offloads 20-25% of sparse parameters to this lookup system, freeing compute for actual reasoning. A 27B test model with Engram showed 3-5 point benchmark improvements across knowledge, reasoning, and coding tasks. Needle-in-a-Haystack accuracy jumped from 84.2% to 97%.
Standard transformer memory
All knowledge stored in learned weights. Every forward pass must reconstruct static facts through attention. O(n) complexity scales with context length. GPU VRAM handles everything.
Engram memory (V4)
Static knowledge offloaded to hash-based DRAM lookup. O(1) retrieval regardless of context length. GPU reserved for reasoning. Benchmark improvements of 3-5 points in testing.
2. Manifold-Constrained Hyper-Connections (mHC)
Training a trillion-parameter model is unstable. Standard residual connections maintain training stability by preserving identity mapping through layers. Hyper-Connections extended this with multiple parallel information streams, but broke the identity mapping guarantee, causing catastrophic signal amplification at scale.
mHC fixes this by constraining the mixing matrices to the Birkhoff Polytope, a mathematical manifold of doubly stochastic matrices. The Sinkhorn-Knopp algorithm enforces the constraint. Doubly stochastic matrices preserve signal magnitude, so residual streams neither explode nor collapse. The overhead is only 6-7% extra training compute.
3. DeepSeek Sparse Attention
With a 1M-token context window, standard full attention is computationally prohibitive. DeepSeek Sparse Attention is a custom attention mechanism designed for long-context efficiency. Combined with the Engram O(1) memory system, V4 can handle 1M-token contexts without quadratic attention cost.
Engram
O(1) static knowledge lookup. Offloads 20-25% of sparse parameters to DRAM hash tables. 3-5 point benchmark gains and 97% Needle-in-a-Haystack accuracy.
mHC
Stable trillion-scale training via constrained mixing matrices on the Birkhoff Polytope. Only 6-7% training overhead vs unconstrained hyper-connections.
Sparse Attention
Custom attention for 1M-token contexts. Works with Engram to eliminate quadratic scaling on long-context tasks like full-codebase reasoning.
Benchmark Performance
Benchmark caveat
All V4 scores below come from pre-release internal DeepSeek benchmarks or leaks. No independent evaluation exists yet. Scores for V3, Claude Opus 4.6, and GPT-5.3 Codex are from published results. Compare carefully.
| Benchmark | DeepSeek V4 (claimed) | Claude Opus 4.6 | GPT-5.3 Codex | DeepSeek V3.2 |
|---|---|---|---|---|
| SWE-bench Verified | 80-85% | 80.8% | 77.3% | ~65% |
| HumanEval | ~90% | ~88% | ~85% | ~80% |
| Context window | 1M tokens | 1M tokens (beta) | 128K tokens | 128K tokens |
| Active params | 32B | N/A (dense) | N/A (dense) | 37B |
| Input price / 1M tokens | ~$0.14 (projected) | ~$15 | ~$15 | $0.27 |
The SWE-bench Verified claim is the one to watch. Claude Opus 4.5 was the first model to crack 80% in that benchmark. If V4 exceeds 80.8% at $0.14/M input tokens, it changes the cost structure for agentic coding workloads by an order of magnitude.
For Coding: How Good Is It?
DeepSeek V4 was built with coding as a primary target. According to Reuters and The Information, internal benchmarks show V4 outperforming both Claude and GPT series specifically on extremely long code prompts, which is where the 1M-token context and Engram memory most directly apply.
Long-Context Code Tasks
Most coding benchmarks test short, self-contained problems. V4's architecture specifically targets long-context software engineering: understanding large codebases, tracing dependencies across files, maintaining coherence over multi-step refactors. The Engram memory system means the model can hold static facts about libraries and syntax in O(1) lookup while applying attention to the actual reasoning problem.
SWE-bench Verified Score
SWE-bench Verified tests real GitHub issues: the model must read a real codebase, understand the bug, write a patch, and pass the existing test suite. Claude Opus 4.5 holds the current record at 80.9%. V4's pre-release claim of 80-85% would either match that or set a new record. Cost per task is estimated at $0.03 vs $0.72 for Claude, a 24x cost reduction if the performance claims hold.
Hardware and Speed
V4 is optimized for Huawei Ascend and Cambricon chips rather than Nvidia H100/H200. The Ascend 910C delivers roughly 60% of H100's peak FP16 performance, but 1.8x better performance-per-watt. Because V4 activates only 32B parameters per token despite a 1T total, inference is cheaper than a dense model of equivalent capability.
What Changed from DeepSeek V3
| Dimension | DeepSeek V3 | DeepSeek V4 |
|---|---|---|
| Total parameters | 671B | ~1T (50% larger) |
| Active parameters | 37B per token | ~32B per token (cheaper inference) |
| Context window | 128K tokens | 1M tokens (8x larger) |
| Memory system | Standard attention weights | Engram O(1) conditional memory |
| Residual connections | Standard hyper-connections | mHC (manifold-constrained) |
| Multimodal | Text only | Native text, image, video |
| Experts per token | ~8 experts | 16 experts (pre-release report) |
| Hardware target | Nvidia H800 | Huawei Ascend, Cambricon |
| Training cost | $5.6M (2.788M GPU hours) | Not disclosed |
The active parameter drop from 37B to 32B is counterintuitive: a bigger total model activates fewer parameters per token. This is intentional. Engram offloads static knowledge to hash lookups, so the neural network handles a narrower set of reasoning tasks. The result should be faster inference at lower cost despite the larger total model size.
API Pricing
V4 pricing has not been officially announced. Based on V3's pricing trajectory and pre-release reports, projected pricing is $0.14/M input tokens and $0.28/M output tokens, with cached input at $0.07/M tokens.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context |
|---|---|---|---|
| DeepSeek V4 (projected) | ~$0.14 | ~$0.28 | 1M tokens |
| DeepSeek V3.2 | $0.27 | $1.10 | 128K tokens |
| Claude Opus 4.6 | ~$15 | ~$75 | 1M tokens (beta) |
| GPT-5.3 Codex | ~$15 | ~$60 | 128K tokens |
| Gemini 3 Pro | ~$3.50 | ~$10.50 | 1M tokens |
The pricing advantage comes from MoE architecture (only 32B parameters active per token), Engram's offloading of static retrieval to DRAM, and DeepSeek's use of Huawei Ascend chips, which cost less per inference hour than Nvidia A100/H100 clusters. This is not dumping. It is a structural cost difference.
V4 Lite
Pre-release reports mention a V4 Lite variant with 200B parameters and a 1M-token context window. This would be the production workhorse for cost-sensitive applications, while the full 1T version handles high-complexity tasks. Pricing for V4 Lite has not been disclosed.
Community Reaction
Developer communities reacted with the same combination of excitement and skepticism that met DeepSeek V3 and R1 on release.
What Developers Are Saying
The r/LocalLLaMA thread on V4 scored 308 points within hours. The dominant sentiment: excitement about the 1M-token context and Engram architecture, skepticism about unverified internal benchmarks. A recurring comment: "DeepSeek is the disruption we need" — pointing to open weights as the differentiator from OpenAI and Anthropic's closed ecosystem shift.
Critics on r/LocalLLaMA and r/Singularity raised the standard objections: DeepSeek's reasoning models waste compute on simple tasks, and internal benchmarks do not reflect real-world messiness. Both observations apply to every vendor's pre-release numbers.
Hardware Community Focus
Tom's Hardware and the hardware community focused on the Huawei Ascend optimization. This is the first frontier model optimized for non-Nvidia hardware at launch. V4 runs best on Ascend and Cambricon chips. Nvidia Blackwell compatibility exists but is secondary to the Chinese chip stack. This is a deliberate supply chain choice that reduces V4's dependency on U.S. export-controlled hardware.
On HN
The Hacker News discussion centered on the Engram architecture paper. Commenters with ML backgrounds flagged the Needle-in-a-Haystack jump from 84.2% to 97% as the most technically credible claim in the pre-release materials. Engram solves a real problem and has a published paper (arxiv:2601.07372). The benchmark numbers are less verifiable before release, but the architecture is sound.
Limitations
Known limitations and open questions
- Benchmarks unverified: All V4 performance numbers come from DeepSeek internal testing. Independent evaluations from LMSYS, ARC, or third-party researchers have not been published.
- Hardware availability: V4 is optimized for Huawei Ascend and Cambricon. Running it on Nvidia hardware at launch may not achieve the same performance or cost profile as reported.
- Multimodal quality unknown: V4 is the first DeepSeek model with native video and image generation. No benchmarks for image or video quality have been released.
- Export controls: U.S. export controls limit access to DeepSeek's API in some jurisdictions. Self-hosting avoids API restrictions but requires significant GPU resources for a 1T model.
- V4 Lite details unclear: The 200B Lite variant has been mentioned in reports but not formally announced. Pricing and availability are unconfirmed.
- Release date slipped before: The mid-February and late-February windows both passed without release. March 2026 has the strongest signal yet (Financial Times confirmed), but treat with appropriate uncertainty.
Frequently Asked Questions
When is DeepSeek V4 releasing?
The first week of March 2026, per Financial Times reporting on February 27. The release is timed to China's Two Sessions parliamentary meetings beginning March 4. Prior windows (mid-February, late February) passed without release.
How many parameters does DeepSeek V4 have?
Approximately 1 trillion total, with roughly 32 billion active per token. V4 uses mixture-of-experts routing with 16 expert pathways. The total is 50% larger than V3 (671B), but active parameters per token are lower (32B vs 37B), which keeps inference cost down.
What is Engram memory?
Engram is a conditional memory module that replaces attention-based retrieval for static knowledge. It uses hash-based O(1) lookups in DRAM instead of GPU VRAM. A published paper (arxiv:2601.07372) shows 3-5 point benchmark improvements and Needle-in-a-Haystack accuracy jumping from 84.2% to 97% on a 27B test model.
What are V4's SWE-bench scores?
Pre-release internal benchmarks claim 80-85% on SWE-bench Verified. Claude Opus 4.5 currently holds the record at 80.9%. These numbers are from DeepSeek's own testing only. Wait for independent verification before making infrastructure decisions based on them.
What is V4's API pricing?
Projected: $0.14/M input tokens, $0.28/M output tokens, $0.07/M cached input. Official pricing will be on DeepSeek's API docs at launch. Even if final prices come in higher, V4 will be significantly cheaper than Claude Opus 4.6 or GPT-5.3.
Is V4 open source?
Expected to be open-weight under a modified OpenRAIL-style license, consistent with V3 and V3.2. Code repository typically goes MIT. Model weights allow commercial use with some restrictions. Exact license text will be on the Hugging Face repo at release.
How does V4 differ from V3?
Three major changes: context window expanded 8x to 1M tokens, Engram memory added, and mHC stabilizes trillion-scale training. Plus native multimodal support (V3 was text-only) and optimization for Huawei Ascend chips instead of Nvidia H800. Active parameters per token actually dropped from 37B to 32B despite the larger total.
Can I use V4 for long codebase tasks?
The 1M-token context and Engram architecture are specifically designed for this. V4 can hold an entire large codebase in context while using O(1) lookup for static patterns like library APIs and syntax. According to sources cited by Reuters and The Information, V4 outperforms competitors specifically on long-form code prompts.
Related Articles
Use WarpGrep with DeepSeek V4 for Better Code Search Context
WarpGrep is an agentic code search tool that works as an MCP server. Connect it to any DeepSeek-powered agent for high-precision codebase context, so V4's 1M-token window gets filled with the right code, not noise.
Sources
- TechNode: DeepSeek plans V4 multimodal model release this week (March 2, 2026)
- arxiv:2601.07372 — Conditional Memory via Scalable Lookup (Engram paper)
- GitHub: deepseek-ai/Engram
- Hugging Face Papers: mHC — Manifold-Constrained Hyper-Connections
- Tom's Hardware: DeepSeek touts memory breakthrough with Engram
- Decrypt: Insiders say DeepSeek V4 will beat Claude and ChatGPT at coding
- Tom's Hardware: DeepSeek research on Huawei Ascend 910C vs H100
- WaveSpeedAI: DeepSeek V4 — everything we know
- NxCode: DeepSeek V4 — 1T-parameter model guide
- Introl: DeepSeek V4 trillion-parameter architecture
- Pandaily: DeepSeek V4 multimodal model with native image, video, text generation