ApX 标志

趋近智

DeepSeek-R1 671B

活跃参数

671B

上下文长度

131.072K

模态

Text

架构

Mixture of Experts (MoE)

许可证

MIT License

发布日期

27 Dec 2024

知识截止

-

技术规格

专家参数总数

37.0B

专家数量

64

活跃专家

6

注意力结构

Multi-Layer Attention

隐藏维度大小

2048

层数

61

注意力头

128

键值头

128

激活函数

-

归一化

-

位置嵌入

ROPE

系统要求

不同量化方法和上下文大小的显存要求

DeepSeek-R1 671B

DeepSeek-R1 represents a class of advanced reasoning models developed by DeepSeek, designed to facilitate complex computational tasks and logical inference. It is built upon a Mixture-of-Experts (MoE) architecture, featuring a total of 671 billion parameters, with approximately 37 billion parameters actively engaged during each inference pass. This architecture, inherited from the DeepSeek-V3 base model, incorporates Multi-head Latent Attention (MLA) for efficient processing of extensive datasets and includes an auxiliary-loss-free strategy for effective load balancing during training. The model further leverages Multi-Token Prediction (MTP) to enhance predictive accuracy and expedite output generation.

The training methodology for DeepSeek-R1 emphasizes reinforcement learning (RL) to cultivate sophisticated reasoning capabilities. Initially, a precursor, DeepSeek-R1-Zero, demonstrated emergent reasoning behaviors such as self-verification and the generation of multi-step chain-of-thought (CoT) sequences through large-scale RL without preliminary supervised fine-tuning (SFT). DeepSeek-R1 refines this approach by integrating a small amount of 'cold-start' data prior to the RL stages, which addresses challenges observed in DeepSeek-R1-Zero, such as repetitive outputs and language mixing, thereby enhancing model stability and overall reasoning performance. The training pipeline for DeepSeek-R1 specifically incorporates two RL stages focused on discovering improved reasoning patterns and aligning with human preferences, alongside two SFT stages that initialize the model's reasoning and non-reasoning capabilities.

DeepSeek-R1 is engineered to excel in domains requiring analytical thought, including high-level mathematics, programming, and scientific inquiry. Its design supports a large context length, enabling processing of extended inputs. To broaden accessibility and deployment options, DeepSeek has also released several distilled versions of DeepSeek-R1, ranging from 1.5 billion to 70 billion parameters. These smaller models are designed to retain a significant portion of the reasoning capacity of the full model, making them suitable for environments with more constrained computational resources.

关于 DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.


其他 DeepSeek-R1 模型

评估基准

排名适用于本地LLM。

排名

#1

基准分数排名

0.76

🥇

1

0.91

🥇

1

Agentic Coding

LiveBench Agentic

0.27

🥇

1

0.85

🥇

1

0.72

🥇

1

Web Development

WebDev Arena

1407.45

🥇

1

Professional Knowledge

MMLU Pro

0.85

🥇

1

Graduate-Level QA

GPQA

0.81

🥇

1

0.96

🥈

2

0.96

🥈

2

General Knowledge

MMLU

0.81

🥉

3

0.52

4

0.77

5

排名

排名

#1 🥇

编程排名

#2 🥈

GPU 要求

完整计算器

选择模型权重的量化方法

上下文大小:1024 个令牌

1k
64k
128k

所需显存:

推荐 GPU