Active Parameters
671B
Context Length
128K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
10 Jan 2026
Knowledge Cutoff
May 2025
Total Expert Parameters
37.0B
Number of Experts
257
Active Experts
9
Attention Structure
Multi-Head Attention
Hidden Dimension Size
7168
Number of Layers
61
Attention Heads
128
Key-Value Heads
1
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
DeepSeek-V3.2 represents an evolution in the deployment of large-scale Mixture-of-Experts (MoE) architectures, specifically optimized for agentic workflows and advanced reasoning tasks. The model utilizes 671 billion total parameters, but maintains a highly efficient inference profile by activating only 37 billion parameters for any given token. This sparse activation strategy allows the model to achieve the representational capacity of a trillion-parameter class model while maintaining the computational overhead and latency characteristic of much smaller dense architectures. The training objective incorporates a Multi-Token Prediction (MTP) strategy, which densifies training signals and improves the model's ability to plan subsequent outputs in complex sequences.
The architectural foundation of DeepSeek-V3.2 is built upon DeepSeek Sparse Attention (DSA), a technical advancement over the previous Multi-head Latent Attention (MLA). DSA further optimizes memory utilization and throughput by employing a low-rank compression of Key-Value (KV) caches, effectively mitigating the memory bottlenecks typically encountered in long-context generation. The model also features an auxiliary-loss-free load balancing mechanism, which ensures high expert utilization without the performance trade-offs commonly associated with traditional load-balancing penalties. This is achieved through a dynamic bias adjustment that routes tokens based on real-time affinity scores across 256 routed experts and one shared expert.
Functionally, DeepSeek-V3.2 is designed to serve as a high-performance foundation for autonomous agents and complex problem-solving environments. It integrates a 'thinking' mode directly into tool-use scenarios, allowing for multi-step reasoning before executing external function calls. With a context window of 163,840 tokens and a training corpus comprising 14.8 trillion high-quality tokens, the model is suited for enterprise-grade applications requiring deep mathematical reasoning, competitive programming proficiency, and reliable multilingual generation. The release is governed by the MIT license, permitting broad use across both academic research and commercial production environments.
DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.
Rank
#48
| Benchmark | Score | Rank |
|---|---|---|
Coding LiveBench Coding | 0.76 | 12 |
Web Development WebDev Arena | 1419 | 13 |
Agentic Coding LiveBench Agentic | 0.47 | 14 |
Graduate-Level QA GPQA | 0.8 | 17 |
Reasoning LiveBench Reasoning | 0.44 | 28 |
Data Analysis LiveBench Data Analysis | 0.67 | 33 |
Mathematics LiveBench Mathematics | 0.64 | 35 |
Overall Rank
#48
Coding Rank
#9
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens