Qwen 3.5 vs DeepSeek V4: China's Top Open-Weight Models Compared (2026)

Qwen 3.5 is live with 397B MoE params and 83.6 LiveCodeBench. DeepSeek V4 targets 1T params with leaked 80%+ SWE-bench scores. We compare architecture, benchmarks, pricing, and self-hosting for both.

March 2, 2026 ยท 1 min read

Qwen 3.5 and DeepSeek V4 are the two most important open-weight models coming out of China in 2026. Qwen 3.5 shipped on February 16 with a 397B MoE architecture that activates just 17B parameters per token. DeepSeek V4, expected in early March, scales to 1 trillion parameters with 32B active.

Both target coding dominance. Both are Apache 2.0 licensed. Both undercut Western frontier models on price by 10x or more. The question for developers: which one should you actually use?

This comparison uses confirmed numbers for Qwen 3.5 and leaked/projected specs for DeepSeek V4. We mark unverified claims clearly throughout.

TL;DR

  • Available now: Qwen 3.5 is live with verified benchmarks. DeepSeek V4 is expected in the first week of March 2026.
  • Architecture: Qwen 3.5 uses 397B params / 17B active (MoE). DeepSeek V4 uses ~1T params / 32B active (MoE + Engram memory).
  • Coding: Qwen 3.5 scores 83.6 LiveCodeBench, 76.4 SWE-bench. DeepSeek V4 claims 80%+ SWE-bench, 90% HumanEval (unverified).
  • Context: Qwen 3.5 supports 262K tokens (open-weight) or 1M (hosted Plus). DeepSeek V4 targets 1M tokens natively.
  • Pricing: Qwen 3.5 Plus costs $0.10-$1.20/M tokens. DeepSeek V4 is projected at $0.27/$1.10 per M tokens, consistent with DeepSeek's history of aggressive pricing.
  • Self-hosting: Both run on consumer GPUs with quantization. Qwen 3.5 on a single RTX 4090; DeepSeek V4 on dual 4090s or single 5090.

Quick Comparison Table

SpecificationQwen 3.5 (397B-A17B)DeepSeek V4
StatusReleased Feb 16, 2026Expected early March 2026
Total Parameters397B~1T
Active Parameters17B~32B
ArchitectureMoE + Gated DeltaNetMoE + Engram + mHC
Context Window262K (open) / 1M (Plus)1M tokens
MMLU88.5TBD
SWE-bench Verified76.480%+ (leaked)
LiveCodeBench v683.6TBD
HumanEval99.0~90% (leaked)
Languages201TBD
MultimodalYes (text, image, video)Yes (expected)
LicenseApache 2.0Apache 2.0 (expected)
API Input Price$0.10/M tokens (Flash)$0.27/M tokens (est.)
API Output Price$0.40/M tokens (Flash)$1.10/M tokens (est.)
Self-Host Minimum1x RTX 4090 (INT4)2x RTX 4090 (est.)

Architecture: Two Different Approaches to Sparse Efficiency

Both models use Mixture-of-Experts to keep inference costs low despite massive total parameter counts. The similarities end there.

Qwen 3.5: Gated DeltaNet + Sparse MoE

Qwen 3.5 combines Gated Delta Networks with sparse MoE routing. The 397B model activates only 17B parameters per forward pass, a 95% reduction in activation memory compared to a dense model of equivalent capacity. The architecture uses a hybrid attention mechanism (Gated DeltaNet combined with Gated Attention layers) that reduces compute overhead for long-context processing.

397B
Total parameters
17B
Active per token
250K
Vocabulary size

The expanded 250K vocabulary (up from 152K in Qwen 3) delivers 10-60% token savings for non-English text. Training used a native FP8 pipeline, and the model achieves 8.6-19x faster decoding than Qwen3-Max on long-context tasks.

DeepSeek V4: Engram Memory + mHC + Sparse Attention

DeepSeek V4 introduces three architectural innovations on top of its trillion-parameter MoE backbone.

~1T
Total parameters
~32B
Active per token
1M
Native context window

Engram Conditional Memory is the standout innovation. It provides O(1) hash-based knowledge retrieval using system DRAM (not GPU VRAM), improving Needle-in-a-Haystack accuracy from 84.2% to 97%. For coding, this means the model can maintain a persistent understanding of project structure, naming conventions, and cross-file relationships.

Manifold-Constrained Hyper-Connections (mHC) solve a fundamental training stability problem at trillion-parameter scale. They constrain signal amplification to 1.6x (vs 3000x unconstrained), adding only 6.7% training overhead for a 4x residual stream expansion. Benchmark improvements from mHC alone: BBH +7.2, DROP +5.7, GSM8K +6.1, MMLU +5.2.

DeepSeek Sparse Attention (DSA) enables the 1M token context window while cutting compute overhead by approximately 50% vs standard attention. It achieves roughly linear scaling instead of quadratic, which is what makes million-token contexts practical at inference time.

Architecture Bottom Line

Qwen 3.5 is the more efficient model per active parameter: 17B active vs 32B for V4. DeepSeek V4 is the more ambitious model architecturally, with three novel components targeting repository-scale code understanding. If you care about running on minimal hardware, Qwen 3.5 wins. If you care about pushing the boundary on what a model can reason about across a full codebase, DeepSeek V4 is the one to watch.

Benchmark Breakdown

Important caveat: Qwen 3.5 benchmarks are from Alibaba's official release and have been partially validated by independent evaluations. DeepSeek V4 numbers are from leaked internal testing and remain unverified. Treat the V4 column accordingly.

BenchmarkQwen 3.5 (397B)DeepSeek V4 (est.)Reference
MMLU88.5TBDGPT-5.2: ~89
MMLU-Pro87.8TBDClaude Opus 4.5: ~88
GPQA Diamond88.4TBDGemini 3 Pro: ~87
AIME 202691.3TBDClaude Opus 4.5: ~86
BrowseComp78.6TBDGPT-5: ~71
BenchmarkQwen 3.5 (397B)DeepSeek V4 (est.)Reference
SWE-bench Verified76.480%+ (leaked)Claude Opus 4.5: 80.9
LiveCodeBench v683.6TBDGPT-5.2: ~81
HumanEval99.0~90% (leaked)Claude: ~88
BFCL v4 (Tool Use)72.9TBDGPT-5 mini: 55.5
Terminal-Bench 252.5TBDClaude Code: ~68

What the Numbers Mean

Qwen 3.5's 83.6 on LiveCodeBench v6 is currently the best score from any open-weight model. Its 91.3 on AIME 2026 suggests genuinely strong mathematical reasoning. The 76.4 SWE-bench Verified score trails Claude Opus 4.5 (80.9%) but beats most other frontier models.

The BrowseComp score (78.6) is particularly notable. Qwen 3.5 uses an aggressive context-folding strategy that outperforms every US frontier model on web browsing tasks. The BFCL v4 tool-use score (72.9) makes it one of the strongest open-source options for building AI agents with function-calling.

DeepSeek V4's leaked SWE-bench claim of 80%+ would put it in direct competition with Claude Opus 4.5. If confirmed, it would be the first open-weight model to match a Western frontier model on the most respected coding benchmark. But "leaked" is doing a lot of heavy lifting in that sentence.

Benchmark Caveat

Qwen 3.5's benchmarks are self-reported by Alibaba. DeepSeek V4's are leaked from internal testing. Neither set has undergone fully independent third-party evaluation at the time of writing. Run your own evals on your specific use case before making infrastructure decisions.

Coding Performance: Practical Differences

Benchmarks tell one story. Real coding performance tells another. Here is what we know about how each model handles actual development work.

Qwen 3.5: Strong on Standard Tasks, Craters on Master-Level

Qwen 3.5 performs well on standard and expert-level coding tasks. It supports three inference modes: "Auto" (adaptive thinking with tool use), "Thinking" (deep reasoning), and "Fast" (instant responses without chain-of-thought). It works with OpenClaw, Claude Code, Cline, and Alibaba's Qwen Code.

The catch: independent testing shows Qwen 3.5's Elo drops from ~1550 on expert tasks to 1194 on "Master-level" challenges that require complex multi-file coordination. When a task demands understanding relationships across many files simultaneously, Qwen 3.5 struggles more than its headline numbers suggest.

83.6
LiveCodeBench v6
76.4
SWE-bench Verified
~1550
Expert-level Elo

The 122B-A10B medium variant is worth noting: it scores 72.2 on BFCL-V4, outperforming GPT-5 mini (55.5) by 30% on tool use. For agentic coding tasks that rely heavily on function calling, the medium model may be the better price-performance choice.

DeepSeek V4: Designed for Repository-Scale Understanding

DeepSeek V4's architecture is specifically optimized for the multi-file coordination problem where Qwen 3.5 stumbles. The Engram memory enables O(1) retrieval of project-level patterns, and the 1M native context window can ingest an entire medium-sized codebase in a single pass.

Early reports from testers describe repository-level understanding: diagnosing bugs that span multiple files, understanding import-export relationships across dozens of modules, and performing autonomous refactoring with awareness of downstream effects.

The claimed 1.8x inference speedup over V3 (which already offered competitive throughput) suggests DeepSeek is optimizing not just accuracy but practical development velocity.

Coding Verdict

For single-file and standard coding tasks, Qwen 3.5 is excellent and available today. For repository-scale refactoring and complex multi-file tasks, DeepSeek V4's architecture looks purpose-built. But V4's coding claims remain unverified. If you need a production-ready coding model right now, Qwen 3.5 is the only real option.

Open Source and Self-Hosting

Both models are open-weight under Apache 2.0. Both can run on consumer hardware with quantization. The practical self-hosting experience differs significantly.

Qwen 3.5 Self-Hosting

ConfigurationHardwarePerformance
Minimum (INT4)1x RTX 4090 (24GB) + 64GB RAMFunctional, slower
Recommended2x A100 80GB + 128GB RAMProduction-ready
High Performance8x H100 + 256GB RAM~45 tok/s throughput
Mac (3-bit)192GB unified memoryFunctional via llama.cpp
Mac (4-bit)256GB unified memoryBetter quality

The full 397B model is ~807GB on disk at full precision. With INT4 quantization, it fits on a single 24GB GPU with system RAM offloading. The MoE architecture helps: only 17B parameters are active per forward pass, so you do not need to keep all 397B in fast memory simultaneously.

The smaller variants are far more accessible. The 27B dense model runs comfortably on a single RTX 4090. The 35B-A3B MoE activates just 3B parameters per token, making it viable on laptops with 16GB+ VRAM.

DeepSeek V4 Self-Hosting (Projected)

ConfigurationHardwarePerformance
Consumer Minimum2x RTX 4090 or 1x RTX 5090~30-35 tok/s (est.)
Consumer (Batch)Dual RTX 4090 + 64GB RAM~550 tok/s batch size 4
Production4-8x A100/H100Full throughput

DeepSeek V4's trillion parameters make the raw model larger, but the 32B active parameter count and Engram memory offloading to system DRAM keep VRAM requirements manageable. The Engram module can offload 100B+ parameter embedding tables with less than 3% throughput penalty, which is critical for consumer deployments.

DeepSeek's track record matters here. V3 shipped with excellent quantization support and community tooling from day one. If V4 follows the same pattern, expect Ollama, vLLM, and llama.cpp support within days of release.

Self-Hosting Verdict

Qwen 3.5 has the edge for minimal hardware deployments: its 17B active parameters vs DeepSeek V4's 32B means lower minimum VRAM requirements. Qwen 3.5 also has a full range of smaller variants (27B, 35B-A3B, 122B-A10B) for teams that need to match model size to hardware. DeepSeek V4's Engram offloading is clever, but at 1T total parameters, you need more storage and RAM even with aggressive quantization.

API Pricing

Both models aggressively undercut Western frontier models. The pricing gap between Chinese and US models is now 10-100x depending on the tier.

ModelInputOutputNotes
Qwen 3.5-Flash$0.10$0.40Best value tier, 1M context
Qwen 3.5-Plus$0.11-$1.20$0.40-$4.80Tiered by context length
DeepSeek V4 (est.)$0.27$1.10Based on leaked pricing
DeepSeek V3.2$0.14$0.55Current generation (live)
GPT-5.2$1.75$14.00Reference: Western frontier
Claude Opus 4.5$5.00$25.00Reference: Western frontier

Qwen 3.5-Flash at $0.10 input / $0.40 output is the cheapest way to access frontier-adjacent intelligence through an API. It costs roughly 1/13th of Claude Sonnet 4.6 for comparable tasks. The Plus tier is more expensive at higher context lengths but still 10x cheaper than Western alternatives.

DeepSeek V4's projected pricing ($0.27/$1.10) follows DeepSeek's pattern of being slightly more expensive than Qwen but still dramatically cheaper than OpenAI or Anthropic. DeepSeek historically offers 20-50x cheaper pricing than OpenAI on comparable models.

Cost for Typical Coding Workflows

A medium coding session (50K input tokens, 10K output tokens per request, 20 requests/day) costs roughly:

  • Qwen 3.5-Flash: ~$0.18/day ($5.40/month)
  • DeepSeek V4 (est.): ~$0.49/day ($14.70/month)
  • Claude Opus 4.5: ~$10/day ($300/month)

For teams processing thousands of coding requests daily, the 50-100x cost advantage of Chinese open-weight models is not incremental. It changes what is economically feasible.

When to Use Which

Your SituationPick ThisWhy
Need a model todayQwen 3.5Released and available. DeepSeek V4 is not.
Budget-constrained API useQwen 3.5-Flash$0.10/M input is the cheapest frontier-tier option
Repository-scale codingWait for DeepSeek V41M context + Engram memory designed for full-codebase tasks
Single-file coding tasksQwen 3.583.6 LiveCodeBench, strong on standard tasks
Consumer GPU self-hostingQwen 3.5 (27B or 35B)Smallest active params, runs on a single GPU easily
Maximum benchmark scoresDeepSeek V4 (if claims hold)80%+ SWE-bench would match Claude Opus
Multilingual needsQwen 3.5201 languages, 10-60% token savings on non-English
Agentic tool useQwen 3.572.9 BFCL v4, 78.6 BrowseComp, verified scores
Long-context retrievalDeepSeek V4Engram boosts NIAH from 84% to 97% at 1M tokens
Multimodal (text + image + video)Qwen 3.5Native multimodal, shipping now. V4 multimodal is unconfirmed.

The Pragmatic Answer

Use Qwen 3.5 today. Evaluate DeepSeek V4 when it ships. The realistic scenario for most teams is running both, picking the right model per task type. Qwen 3.5 for quick iterations, tool-use-heavy agents, and budget-sensitive workloads. DeepSeek V4 (once available) for deep codebase reasoning and repository-scale refactoring.

Both models work with the same tooling ecosystem: Ollama, vLLM, OpenAI-compatible APIs, and every major coding agent. Switching between them is a config change, not a migration.

Frequently Asked Questions

Is DeepSeek V4 released yet?

As of March 2, 2026, no. The Financial Times reported it will arrive in the first week of March 2026, timed around China's annual Two Sessions parliamentary meetings starting March 4. TechNode reports that DeepSeek plans to release V4 "this week." All V4 benchmark numbers in this article come from leaked internal testing and are not independently verified.

Can I run these models on consumer hardware?

Yes. Qwen 3.5-397B runs on a single RTX 4090 at INT4 quantization with 64GB system RAM. The smaller Qwen 3.5-27B runs on any 24GB GPU without tricks. DeepSeek V4 reportedly targets dual RTX 4090s or a single RTX 5090 as its consumer-tier hardware requirement. Both models benefit from MoE architectures where only a fraction of total parameters are active per token.

Which model is better for coding?

Qwen 3.5 scores 83.6 on LiveCodeBench v6 and 76.4 on SWE-bench Verified, both confirmed. DeepSeek V4 claims 80%+ SWE-bench and 90% HumanEval from leaks. Qwen 3.5 is the safer pick today. DeepSeek V4 may surpass it for repository-level tasks once released, but those claims need independent verification.

How do they compare to GPT-5 and Claude Opus?

Qwen 3.5 claims to outperform GPT-5.2 and Claude Opus 4.5 on 80% of evaluated benchmarks. DeepSeek V4 targets SWE-bench parity with Claude Opus 4.5 (80.9%). The key differentiator is not capability but cost: Qwen 3.5 Flash costs $0.10/M input tokens vs $5.00 for Claude Opus, a 50x gap. Even if the Chinese models trail slightly on absolute benchmarks, the cost difference makes them practical for workloads that would be prohibitively expensive on Western APIs.

Which is better for building AI agents?

Qwen 3.5. Its 72.9 BFCL v4 score (function calling) and 78.6 BrowseComp (web browsing) are the highest among open-weight models. The 122B-A10B variant outperforms GPT-5 mini on tool use by 30%. DeepSeek V4 has not published agentic benchmarks yet.

Related Comparisons

Apply Code Edits from Any Model at 10,500+ tok/sec

Morph Fast Apply works with Qwen 3.5, DeepSeek, Claude, GPT, or any model. It handles the last mile of code editing: taking AI-generated diffs and applying them correctly to your files.