Active Parameters
671B
Context Length
128K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
10 Jan 2026
Knowledge Cutoff
May 2025
Attention
Attention Structure
DeepSeek Sparse Attention
Attention Heads
128
Key-Value Heads
1
Attention Head Dimension
-
Position Embedding
Absolute Position Embedding
RoPE Theta
10,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
7,168
Number of Layers
61
FFN Intermediate Size (Dense)
2,048
Multi-Token Prediction Heads
1
Tokenizer
Vocabulary Size
129,280
Mixture of Experts
Total Expert Parameters
37.0B
Number of Experts
257
Active Experts
9
Shared Experts
1
FFN Intermediate Size (per Expert)
2,048
Dense Layers Before MoE
3
DeepSeek-V3.2 represents an evolution in the deployment of large-scale Mixture-of-Experts (MoE) architectures, specifically optimized for agentic workflows and advanced reasoning tasks. The model utilizes 671 billion total parameters, but maintains a highly efficient inference profile by activating only 37 billion parameters for any given token. This sparse activation strategy allows the model to achieve the representational capacity of a trillion-parameter class model while maintaining the computational overhead and latency characteristic of much smaller dense architectures. The training objective incorporates a Multi-Token Prediction (MTP) strategy, which densifies training signals and improves the model's ability to plan subsequent outputs in complex sequences.
The architectural foundation of DeepSeek-V3.2 is built upon DeepSeek Sparse Attention (DSA), a technical advancement over the previous Multi-head Latent Attention (MLA). DSA further optimizes memory utilization and throughput by employing a low-rank compression of Key-Value (KV) caches, effectively mitigating the memory bottlenecks typically encountered in long-context generation. The model also features an auxiliary-loss-free load balancing mechanism, which ensures high expert utilization without the performance trade-offs commonly associated with traditional load-balancing penalties. This is achieved through a dynamic bias adjustment that routes tokens based on real-time affinity scores across 256 routed experts and one shared expert.
Functionally, DeepSeek-V3.2 is designed to serve as a high-performance foundation for autonomous agents and complex problem-solving environments. It integrates a 'thinking' mode directly into tool-use scenarios, allowing for multi-step reasoning before executing external function calls. With a context window of 163,840 tokens and a training corpus comprising 14.8 trillion high-quality tokens, the model is suited for enterprise-grade applications requiring deep mathematical reasoning, competitive programming proficiency, and reliable multilingual generation. The release is governed by the MIT license, permitting broad use across both academic research and commercial production environments.
DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.
Rank
#85
| Benchmark | Score | Rank |
|---|---|---|
Coding Aider Coding | 0.70 | 11 |
Coding LiveBench Coding | 0.76 | 19 |
Agentic Coding LiveBench Agentic | 0.47 | 24 |
Professional Knowledge MMLU Pro | 0.83 | 27 |
Graduate-Level QA GPQA | 0.799 | 29 |
Reasoning LiveBench Reasoning | 0.44 | 46 |
Mathematics LiveBench Mathematics | 0.64 | 47 |
Web Development WebDev Arena | 1330 | 48 |
Data Analysis LiveBench Data Analysis | 0.45 | 51 |
Overall Rank
#85
Coding Rank
#35
Total Score
80
/ 100
DeepSeek-V3.2 exhibits a high level of technical transparency, particularly regarding its complex Mixture-of-Experts architecture and training compute efficiency. The model's use of a permissive MIT license and detailed disclosure of active vs. total parameters sets a strong industry standard for open-weights models. However, like many frontier models, it maintains significant opacity regarding the specific composition and sourcing of its massive 14.8 trillion token training corpus.
Architectural Provenance
DeepSeek-V3.2 is extensively documented through a technical report and model cards. It explicitly builds on the DeepSeek-V3/V3.1-Terminus architecture, utilizing a Mixture-of-Experts (MoE) framework with 671B total and 37B active parameters. Key architectural innovations like DeepSeek Sparse Attention (DSA) and Multi-head Latent Attention (MLA) are detailed with mathematical formulations and diagrams in the official paper. The training methodology, including the use of Multi-Token Prediction (MTP) and an auxiliary-loss-free load balancing mechanism, is publicly described.
Dataset Composition
While the total token count (14.8 trillion) and general categories (web pages, e-books, code, math) are disclosed, the specific proportions of the dataset are not provided in detail. The documentation mentions 'enhancing the ratio' of math and code but lacks a granular percentage breakdown. Information on the '128K long context extension data' is described as 'aligned' with previous versions but lacks specific source disclosure. The model uses a 'Large-Scale Agentic Task Synthesis Pipeline' for post-training, which is documented as a methodology, but the underlying raw data sources remain largely opaque.
Tokenizer Integrity
The model uses the same tokenizer as DeepSeek-V3, which is publicly accessible via Hugging Face (LlamaTokenizerFast). The vocabulary size and special tokens (e.g., for tool use and reasoning blocks) are clearly defined in the tokenizer_config.json. Documentation explicitly notes the stability of the tokenizer across versions V3 to V3.2, and the 128K context window alignment is verified through public configuration files.
Parameter Density
DeepSeek provides exemplary transparency regarding its MoE architecture. It clearly distinguishes between total parameters (671B) and active parameters (37B). The breakdown of experts (256 routed experts + 1 shared expert) is explicitly stated. This level of detail prevents the common 'parameter inflation' seen in other MoE models and allows for accurate computational modeling by third parties.
Training Compute
The technical report provides specific compute metrics, stating the use of 2,048 NVIDIA H800 GPUs over approximately two months. Total training compute is disclosed as 2.788 million GPU hours, with a cost estimate of ~$5.576M based on rental rates. While it lacks a formal third-party carbon audit, the provided energy consumption data (180K GPU hours per trillion tokens) allows for independent environmental impact calculations.
Benchmark Reproducibility
DeepSeek discloses scores across a wide array of standard benchmarks (MMLU-Pro, AIME 2025, SWE-Bench) and provides some evaluation scripts and chat templates on GitHub. However, the full evaluation pipeline and exact few-shot prompts for all benchmarks are not as comprehensively documented as the architecture. There is a noted discrepancy between internal math scores (~89% AIME) and some independent estimates, though the model's performance is generally verifiable through public leaderboards like LMSYS.
Identity Consistency
The model consistently identifies itself as a DeepSeek-developed AI and maintains version awareness (V3.2). It is transparent about its 'thinking' vs 'non-thinking' modes and the specific limitations of variants like 'Speciale' (which lacks tool-calling). There are no documented instances of the model claiming to be a competitor's product or denying its nature as an AI.
License Clarity
The model is released under the MIT License, which is one of the most permissive and transparent open-source licenses available. The license terms are clearly stated on GitHub and Hugging Face, explicitly permitting commercial use, modification, and redistribution without the restrictive 'acceptable use' clauses often found in other 'open' weights models.
Hardware Footprint
Hardware requirements are well-documented for various deployment scenarios. Official and community guides (vLLM, SGLang) provide specific VRAM requirements for FP16 (~1.5TB) and 4-bit quantization (~350-400GB). The impact of context length on memory scaling is addressed, and the documentation provides clear guidance on multi-GPU configurations (e.g., 8x or 16x 80GB GPUs) required for efficient inference.
Versioning Drift
DeepSeek maintains a clear versioning history (V3 -> V3.1 -> V3.2) with associated technical updates for each. While it lacks a formal, real-time 'drift' dashboard, the release of specific checkpoints (e.g., 0324, 0528) and the use of semantic-style versioning allow developers to track changes. The transition from experimental (Exp) to stable releases is documented, though detailed changelogs for minor weight updates could be more granular.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online