DeepSeek-R1 671B: Specifications and GPU VRAM Requirements

DeepSeek-R1 671B

开源

开放权重

活跃参数

671B

上下文长度

131.072K

模态

Text

架构

Mixture of Experts (MoE)

许可证

MIT License

发布日期

27 Dec 2024

训练数据截止日期

技术规格

专家参数总数

37.0B

专家数量

活跃专家

注意力结构

Multi-Layer Attention

隐藏维度大小

2048

层数

注意力头

128

键值头

128

激活函数

归一化

位置嵌入

ROPE

系统要求

不同量化方法和上下文大小的显存要求

DeepSeek-R1 671B

DeepSeek-R1 represents a class of advanced reasoning models developed by DeepSeek, designed to facilitate complex computational tasks and logical inference. It is built upon a Mixture-of-Experts (MoE) architecture, featuring a total of 671 billion parameters, with approximately 37 billion parameters actively engaged during each inference pass. This architecture, inherited from the DeepSeek-V3 base model, incorporates Multi-head Latent Attention (MLA) for efficient processing of extensive datasets and includes an auxiliary-loss-free strategy for effective load balancing during training. The model further leverages Multi-Token Prediction (MTP) to enhance predictive accuracy and expedite output generation.

The training methodology for DeepSeek-R1 emphasizes reinforcement learning (RL) to cultivate sophisticated reasoning capabilities. Initially, a precursor, DeepSeek-R1-Zero, demonstrated emergent reasoning behaviors such as self-verification and the generation of multi-step chain-of-thought (CoT) sequences through large-scale RL without preliminary supervised fine-tuning (SFT). DeepSeek-R1 refines this approach by integrating a small amount of 'cold-start' data prior to the RL stages, which addresses challenges observed in DeepSeek-R1-Zero, such as repetitive outputs and language mixing, thereby enhancing model stability and overall reasoning performance. The training pipeline for DeepSeek-R1 specifically incorporates two RL stages focused on discovering improved reasoning patterns and aligning with human preferences, alongside two SFT stages that initialize the model's reasoning and non-reasoning capabilities.

DeepSeek-R1 is engineered to excel in domains requiring analytical thought, including high-level mathematics, programming, and scientific inquiry. Its design supports a large context length, enabling processing of extended inputs. To broaden accessibility and deployment options, DeepSeek has also released several distilled versions of DeepSeek-R1, ranging from 1.5 billion to 70 billion parameters. These smaller models are designed to retain a significant portion of the reasoning capacity of the full model, making them suitable for environments with more constrained computational resources.

关于 DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.

其他 DeepSeek-R1 模型

评估基准

排名适用于本地LLM。

排名

基准	分数	排名
Coding LiveBench Coding	0.76	🥇 1
Reasoning LiveBench Reasoning	0.91	🥇 1
Graduate-Level QA GPQA	0.81	🥇 1
Agentic Coding LiveBench Agentic	0.27	🥈 2
Coding Aider Coding	0.73	🥈 2
StackEval ProLLM Stack Eval	0.96	🥈 2
Web Development WebDev Arena	1392.62	🥈 2
General Knowledge MMLU	0.81	🥈 2
Mathematics LiveBench Mathematics	0.85	🥉 3
Data Analysis LiveBench Data Analysis	0.72	🥉 3
QA Assistant ProLLM QA Assistant	0.96	🥉 3
StackUnseen ProLLM Stack Unseen	0.52	5
Summarization ProLLM Summarization	0.77	7
Professional Knowledge MMLU Pro	0.85	8
Refactoring Aider Refactoring	0.33	17

排名

#1 🥇

编程排名

GPU 要求

完整计算器

量化

选择模型权重的量化方法

上下文大小：1024 个令牌

64k

128k

所需显存:

资源

官方文档阅读论文下载权重源代码

DeepSeek-R1 671B

技术规格

系统要求

DeepSeek-R1 671B

关于 DeepSeek-R1

其他 DeepSeek-R1 模型

评估基准

排名

GPU 要求

所需显存:

推荐 GPU

资源