Active Parameters
671B
Context Length
128K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
10 Jan 2026
Knowledge Cutoff
Jul 2024
Total Expert Parameters
37.0B
Number of Experts
256
Active Experts
8
Attention Structure
Multi-Head Attention
Hidden Dimension Size
7168
Number of Layers
61
Attention Heads
128
Key-Value Heads
1
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
DeepSeek-V3.2 Thinking is an advanced reasoning-enhanced language model that integrates large-scale reinforcement learning with a massive mixture-of-experts (MoE) architecture. As the reasoning-specialized variant of the V3.2 series, it is engineered to prioritize logical consistency and systematic problem-solving through an explicit chain-of-thought (CoT) process. The model is specifically optimized for complex domains such as mathematics, algorithmic programming, and multi-step agentic workflows, where it generates detailed reasoning traces prior to producing a final response. This transparency into the model's internal logic allows for more reliable verification of complex outputs and supports sophisticated tool-integration scenarios.
Technically, the model utilizes a sparse Mixture-of-Experts (MoE) framework comprising 671 billion total parameters, with 37 billion parameters activated per token to maintain high computational efficiency. A significant architectural advancement in this version is the introduction of DeepSeek Sparse Attention (DSA), which reduces the computational complexity of the attention mechanism from quadratic to nearly linear. This innovation, instantiated under Multi-Head Latent Attention (MLA), enables the model to process long-context sequences with substantially lower memory and compute overhead. The model also employs a Group Relative Policy Optimization (GRPO) framework for reinforcement learning, which stabilizes training by utilizing group-based baselines instead of a separate critic network.
DeepSeek-V3.2 Thinking is designed for high-stakes reasoning applications, including scientific research, debugging intricate software logic, and executing autonomous agentic tasks. It supports a 128k context window and introduces a 'thinking with tools' capability, allowing the model to perform interleaved reasoning and API calls. The integration of Multi-Token Prediction (MTP) during training further enhances its internal representations, leading to faster convergence and more robust performance on reasoning-heavy benchmarks. Released under the MIT license, this model provides an open-weight foundation for researchers and developers seeking to deploy frontier-class reasoning capabilities in local or enterprise environments.
DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.
Rank
#18
| Benchmark | Score | Rank |
|---|---|---|
Professional Knowledge MMLU Pro | 0.85 | 🥇 1 |
Mathematics LiveBench Mathematics | 0.85 | ⭐ 6 |
Data Analysis LiveBench Data Analysis | 0.73 | 8 |
Reasoning LiveBench Reasoning | 0.77 | 10 |
Graduate-Level QA GPQA | 0.82 | 11 |
Web Development WebDev Arena | 1420 | 12 |
Agentic Coding LiveBench Agentic | 0.40 | 18 |
Coding LiveBench Coding | 0.65 | 39 |
Overall Rank
#18
Coding Rank
#56
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens