Active Parameters
671B
Context Length
131.072K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT License
Release Date
27 Dec 2024
Knowledge Cutoff
-
Total Expert Parameters
37.0B
Number of Experts
64
Active Experts
6
Attention Structure
Multi-Layer Attention
Hidden Dimension Size
2048
Number of Layers
61
Attention Heads
128
Key-Value Heads
128
Activation Function
-
Normalization
-
Position Embedding
ROPE
VRAM requirements for different quantization methods and context sizes
DeepSeek-R1 represents a class of advanced reasoning models developed by DeepSeek, designed to facilitate complex computational tasks and logical inference. It is built upon a Mixture-of-Experts (MoE) architecture, featuring a total of 671 billion parameters, with approximately 37 billion parameters actively engaged during each inference pass. This architecture, inherited from the DeepSeek-V3 base model, incorporates Multi-head Latent Attention (MLA) for efficient processing of extensive datasets and includes an auxiliary-loss-free strategy for effective load balancing during training. The model further leverages Multi-Token Prediction (MTP) to enhance predictive accuracy and expedite output generation.
The training methodology for DeepSeek-R1 emphasizes reinforcement learning (RL) to cultivate sophisticated reasoning capabilities. Initially, a precursor, DeepSeek-R1-Zero, demonstrated emergent reasoning behaviors such as self-verification and the generation of multi-step chain-of-thought (CoT) sequences through large-scale RL without preliminary supervised fine-tuning (SFT). DeepSeek-R1 refines this approach by integrating a small amount of 'cold-start' data prior to the RL stages, which addresses challenges observed in DeepSeek-R1-Zero, such as repetitive outputs and language mixing, thereby enhancing model stability and overall reasoning performance. The training pipeline for DeepSeek-R1 specifically incorporates two RL stages focused on discovering improved reasoning patterns and aligning with human preferences, alongside two SFT stages that initialize the model's reasoning and non-reasoning capabilities.
DeepSeek-R1 is engineered to excel in domains requiring analytical thought, including high-level mathematics, programming, and scientific inquiry. Its design supports a large context length, enabling processing of extended inputs. To broaden accessibility and deployment options, DeepSeek has also released several distilled versions of DeepSeek-R1, ranging from 1.5 billion to 70 billion parameters. These smaller models are designed to retain a significant portion of the reasoning capacity of the full model, making them suitable for environments with more constrained computational resources.
DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.
Ranking is for Local LLMs.
Rank
#1
Benchmark | Score | Rank |
---|---|---|
Coding LiveBench Coding | 0.76 | 🥇 1 |
Reasoning LiveBench Reasoning | 0.91 | 🥇 1 |
Agentic Coding LiveBench Agentic | 0.27 | 🥇 1 |
Mathematics LiveBench Mathematics | 0.85 | 🥇 1 |
Data Analysis LiveBench Data Analysis | 0.72 | 🥇 1 |
Web Development WebDev Arena | 1407.45 | 🥇 1 |
Professional Knowledge MMLU Pro | 0.85 | 🥇 1 |
Graduate-Level QA GPQA | 0.81 | 🥇 1 |
StackEval ProLLM Stack Eval | 0.96 | 🥈 2 |
QA Assistant ProLLM QA Assistant | 0.96 | 🥈 2 |
General Knowledge MMLU | 0.81 | 🥉 3 |
StackUnseen ProLLM Stack Unseen | 0.52 | 4 |
Summarization ProLLM Summarization | 0.77 | 5 |
Overall Rank
#1 🥇
Coding Rank
#2 🥈
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens