DeepSeek-R1 671B: Specifications and GPU VRAM Requirements

DeepSeek-R1 671B

Open Source

Open Weights

Active Parameters

671B

Context Length

131.072K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT License

Release Date

27 Dec 2024

Knowledge Cutoff

Technical Specifications

Total Expert Parameters

37.0B

Number of Experts

Active Experts

Attention Structure

Multi-Layer Attention

Hidden Dimension Size

2048

Number of Layers

Attention Heads

128

Key-Value Heads

128

Activation Function

Normalization

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

DeepSeek-R1 671B

DeepSeek-R1 represents a class of advanced reasoning models developed by DeepSeek, designed to facilitate complex computational tasks and logical inference. It is built upon a Mixture-of-Experts (MoE) architecture, featuring a total of 671 billion parameters, with approximately 37 billion parameters actively engaged during each inference pass. This architecture, inherited from the DeepSeek-V3 base model, incorporates Multi-head Latent Attention (MLA) for efficient processing of extensive datasets and includes an auxiliary-loss-free strategy for effective load balancing during training. The model further leverages Multi-Token Prediction (MTP) to enhance predictive accuracy and expedite output generation.

The training methodology for DeepSeek-R1 emphasizes reinforcement learning (RL) to cultivate sophisticated reasoning capabilities. Initially, a precursor, DeepSeek-R1-Zero, demonstrated emergent reasoning behaviors such as self-verification and the generation of multi-step chain-of-thought (CoT) sequences through large-scale RL without preliminary supervised fine-tuning (SFT). DeepSeek-R1 refines this approach by integrating a small amount of 'cold-start' data prior to the RL stages, which addresses challenges observed in DeepSeek-R1-Zero, such as repetitive outputs and language mixing, thereby enhancing model stability and overall reasoning performance. The training pipeline for DeepSeek-R1 specifically incorporates two RL stages focused on discovering improved reasoning patterns and aligning with human preferences, alongside two SFT stages that initialize the model's reasoning and non-reasoning capabilities.

DeepSeek-R1 is engineered to excel in domains requiring analytical thought, including high-level mathematics, programming, and scientific inquiry. Its design supports a large context length, enabling processing of extended inputs. To broaden accessibility and deployment options, DeepSeek has also released several distilled versions of DeepSeek-R1, ranging from 1.5 billion to 70 billion parameters. These smaller models are designed to retain a significant portion of the reasoning capacity of the full model, making them suitable for environments with more constrained computational resources.

About DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.

Other DeepSeek-R1 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

Rank

Benchmark	Score	Rank
Coding LiveBench Coding	0.76	🥇 1
Reasoning LiveBench Reasoning	0.91	🥇 1
Graduate-Level QA GPQA	0.81	🥇 1
Agentic Coding LiveBench Agentic	0.27	🥈 2
Coding Aider Coding	0.73	🥈 2
StackEval ProLLM Stack Eval	0.96	🥈 2
Web Development WebDev Arena	1392.62	🥈 2
General Knowledge MMLU	0.81	🥈 2
Mathematics LiveBench Mathematics	0.85	🥉 3
Data Analysis LiveBench Data Analysis	0.72	🥉 3
QA Assistant ProLLM QA Assistant	0.96	🥉 3
StackUnseen ProLLM Stack Unseen	0.52	5
Summarization ProLLM Summarization	0.77	7
Professional Knowledge MMLU Pro	0.85	8
Refactoring Aider Refactoring	0.33	17

Rankings

Overall Rank

#1 🥇

Coding Rank

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

64k

128k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Read the Paper Download Weights Source Code