趋近智
活跃参数
671B
上下文长度
128K
模态
Text
架构
Mixture of Experts (MoE)
许可证
MIT
发布日期
10 Jan 2026
训练数据截止日期
May 2025
专家参数总数
37.0B
专家数量
257
活跃专家
9
注意力结构
Multi-Head Attention
隐藏维度大小
7168
层数
61
注意力头
128
键值头
1
激活函数
SwigLU
归一化
RMS Normalization
位置嵌入
Absolute Position Embedding
DeepSeek-V3.2 represents an evolution in the deployment of large-scale Mixture-of-Experts (MoE) architectures, specifically optimized for agentic workflows and advanced reasoning tasks. The model utilizes 671 billion total parameters, but maintains a highly efficient inference profile by activating only 37 billion parameters for any given token. This sparse activation strategy allows the model to achieve the representational capacity of a trillion-parameter class model while maintaining the computational overhead and latency characteristic of much smaller dense architectures. The training objective incorporates a Multi-Token Prediction (MTP) strategy, which densifies training signals and improves the model's ability to plan subsequent outputs in complex sequences.
The architectural foundation of DeepSeek-V3.2 is built upon DeepSeek Sparse Attention (DSA), a technical advancement over the previous Multi-head Latent Attention (MLA). DSA further optimizes memory utilization and throughput by employing a low-rank compression of Key-Value (KV) caches, effectively mitigating the memory bottlenecks typically encountered in long-context generation. The model also features an auxiliary-loss-free load balancing mechanism, which ensures high expert utilization without the performance trade-offs commonly associated with traditional load-balancing penalties. This is achieved through a dynamic bias adjustment that routes tokens based on real-time affinity scores across 256 routed experts and one shared expert.
Functionally, DeepSeek-V3.2 is designed to serve as a high-performance foundation for autonomous agents and complex problem-solving environments. It integrates a 'thinking' mode directly into tool-use scenarios, allowing for multi-step reasoning before executing external function calls. With a context window of 163,840 tokens and a training corpus comprising 14.8 trillion high-quality tokens, the model is suited for enterprise-grade applications requiring deep mathematical reasoning, competitive programming proficiency, and reliable multilingual generation. The release is governed by the MIT license, permitting broad use across both academic research and commercial production environments.
DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.
排名
#48
| 基准 | 分数 | 排名 |
|---|---|---|
Coding LiveBench Coding | 0.76 | 12 |
Web Development WebDev Arena | 1419 | 13 |
Agentic Coding LiveBench Agentic | 0.47 | 14 |
Graduate-Level QA GPQA | 0.8 | 17 |
Reasoning LiveBench Reasoning | 0.44 | 28 |
Data Analysis LiveBench Data Analysis | 0.67 | 33 |
Mathematics LiveBench Mathematics | 0.64 | 35 |