DeepSeek-V3.2：规格和 GPU 显存要求

DeepSeek-V3.2

开源

开放权重

活跃参数

671B

上下文长度

128K

模态

Text

架构

Mixture of Experts (MoE)

许可证

MIT

发布日期

10 Jan 2026

训练数据截止日期

May 2025

技术规格

专家参数总数

37.0B

专家数量

257

活跃专家

注意力结构

Multi-Head Attention

隐藏维度大小

7168

层数

注意力头

128

键值头

激活函数

SwigLU

归一化

RMS Normalization

位置嵌入

Absolute Position Embedding

DeepSeek-V3.2

DeepSeek-V3.2 represents an evolution in the deployment of large-scale Mixture-of-Experts (MoE) architectures, specifically optimized for agentic workflows and advanced reasoning tasks. The model utilizes 671 billion total parameters, but maintains a highly efficient inference profile by activating only 37 billion parameters for any given token. This sparse activation strategy allows the model to achieve the representational capacity of a trillion-parameter class model while maintaining the computational overhead and latency characteristic of much smaller dense architectures. The training objective incorporates a Multi-Token Prediction (MTP) strategy, which densifies training signals and improves the model's ability to plan subsequent outputs in complex sequences.

The architectural foundation of DeepSeek-V3.2 is built upon DeepSeek Sparse Attention (DSA), a technical advancement over the previous Multi-head Latent Attention (MLA). DSA further optimizes memory utilization and throughput by employing a low-rank compression of Key-Value (KV) caches, effectively mitigating the memory bottlenecks typically encountered in long-context generation. The model also features an auxiliary-loss-free load balancing mechanism, which ensures high expert utilization without the performance trade-offs commonly associated with traditional load-balancing penalties. This is achieved through a dynamic bias adjustment that routes tokens based on real-time affinity scores across 256 routed experts and one shared expert.

Functionally, DeepSeek-V3.2 is designed to serve as a high-performance foundation for autonomous agents and complex problem-solving environments. It integrates a 'thinking' mode directly into tool-use scenarios, allowing for multi-step reasoning before executing external function calls. With a context window of 163,840 tokens and a training corpus comprising 14.8 trillion high-quality tokens, the model is suited for enterprise-grade applications requiring deep mathematical reasoning, competitive programming proficiency, and reliable multilingual generation. The release is governed by the MIT license, permitting broad use across both academic research and commercial production environments.

关于 DeepSeek-V3

DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.

其他 DeepSeek-V3 模型

评估基准

排名

#48

基准	分数	排名
Coding LiveBench Coding	0.76	12
Web Development WebDev Arena	1419	13
Agentic Coding LiveBench Agentic	0.47	14
Graduate-Level QA GPQA	0.8	17
Reasoning LiveBench Reasoning	0.44	28
Data Analysis LiveBench Data Analysis	0.67	33
Mathematics LiveBench Mathematics	0.64	35

排名

#48

编程排名

GPU 要求

完整计算器

量化

选择模型权重的量化方法

上下文大小：1024 个令牌

63k

125k

所需显存:

资源

官方文档发布说明阅读论文下载权重源代码

DeepSeek-V3.2

技术规格

DeepSeek-V3.2

关于 DeepSeek-V3

其他 DeepSeek-V3 模型

评估基准

排名

GPU 要求

所需显存:

推荐 GPU

资源