趋近智
活跃参数
671B
上下文长度
128K
模态
Text
架构
Mixture of Experts (MoE)
许可证
MIT License
发布日期
21 Aug 2025
知识截止
-
专家参数总数
37.0B
专家数量
257
活跃专家
8
注意力结构
Multi-Head Attention
隐藏维度大小
7168
层数
61
注意力头
-
键值头
-
激活函数
SwigLU
归一化
RMS Normalization
位置嵌入
ROPE
不同量化方法和上下文大小的显存要求
A hybrid model that supports both "thinking" and "non-thinking" modes for chat, reasoning, and coding. It's a Mixture-of-Experts (MoE) model with a massive context length and efficient architecture.
DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.
排名适用于本地LLM。
排名
#3
基准 | 分数 | 排名 |
---|---|---|
General Knowledge MMLU | 0.94 | 🥇 1 |
Coding Aider Coding | 0.76 | 🥈 2 |
Professional Knowledge MMLU Pro | 0.85 | 🥈 2 |
Graduate-Level QA GPQA | 0.80 | 🥈 2 |