Active Parameters
671B
Context Length
131.072K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT License
Release Date
27 Dec 2024
Knowledge Cutoff
-
Total Expert Parameters
37.0B
Number of Experts
64
Active Experts
6
Attention Structure
Multi-Layer Attention
Hidden Dimension Size
2048
Number of Layers
61
Attention Heads
128
Key-Value Heads
128
Activation Function
-
Normalization
-
Position Embedding
ROPE
DeepSeek-R1 represents a class of advanced reasoning models developed by DeepSeek, designed to facilitate complex computational tasks and logical inference. It is built upon a Mixture-of-Experts (MoE) architecture, featuring a total of 671 billion parameters, with approximately 37 billion parameters actively engaged during each inference pass. This architecture, inherited from the DeepSeek-V3 base model, incorporates Multi-head Latent Attention (MLA) for efficient processing of extensive datasets and includes an auxiliary-loss-free strategy for effective load balancing during training. The model further leverages Multi-Token Prediction (MTP) to enhance predictive accuracy and expedite output generation.
The training methodology for DeepSeek-R1 emphasizes reinforcement learning (RL) to cultivate sophisticated reasoning capabilities. Initially, a precursor, DeepSeek-R1-Zero, demonstrated emergent reasoning behaviors such as self-verification and the generation of multi-step chain-of-thought (CoT) sequences through large-scale RL without preliminary supervised fine-tuning (SFT). DeepSeek-R1 refines this approach by integrating a small amount of 'cold-start' data prior to the RL stages, which addresses challenges observed in DeepSeek-R1-Zero, such as repetitive outputs and language mixing, thereby enhancing model stability and overall reasoning performance. The training pipeline for DeepSeek-R1 specifically incorporates two RL stages focused on discovering improved reasoning patterns and aligning with human preferences, alongside two SFT stages that initialize the model's reasoning and non-reasoning capabilities.
DeepSeek-R1 is engineered to excel in domains requiring analytical thought, including high-level mathematics, programming, and scientific inquiry. Its design supports a large context length, enabling processing of extended inputs. To broaden accessibility and deployment options, DeepSeek has also released several distilled versions of DeepSeek-R1, ranging from 1.5 billion to 70 billion parameters. These smaller models are designed to retain a significant portion of the reasoning capacity of the full model, making them suitable for environments with more constrained computational resources.
DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.
Rank
#54
| Benchmark | Score | Rank |
|---|---|---|
QA Assistant ProLLM QA Assistant | 0.96 | 🥉 3 |
Coding Aider Coding | 0.57 | 4 |
StackEval ProLLM Stack Eval | 0.96 | 6 |
Graduate-Level QA GPQA | 0.81 | 13 |
Web Development WebDev Arena | 1398 | 19 |
Overall Rank
#54
Coding Rank
#42
Total Score
76
/ 100
DeepSeek-R1 sets a high standard for transparency in architectural disclosure and licensing, providing a permissive MIT license and clear MoE parameter counts. While it excels in technical documentation of its training pipeline and hardware requirements, it remains relatively opaque regarding the specific sources and composition of its massive pre-training dataset. The model's commitment to open weights and detailed technical reports significantly aids reproducibility, though more granular data provenance and evaluation code would be required for a perfect score.
Architectural Provenance
DeepSeek-R1 provides high transparency regarding its architecture, explicitly stating it is built upon the DeepSeek-V3-Base model. The technical report and GitHub documentation detail the Mixture-of-Experts (MoE) structure, the use of Multi-head Latent Attention (MLA), and the auxiliary-loss-free load balancing strategy. The training methodology is thoroughly documented, describing the transition from DeepSeek-R1-Zero (pure RL) to DeepSeek-R1 (cold-start SFT + RL). While the base model's pre-training is well-documented in the preceding V3 paper, the R1-specific modifications are clearly delineated.
Dataset Composition
While the model documentation mentions the use of 14.8 trillion tokens for the underlying V3 base model and specific 'cold-start' data (thousands of reasoning samples) for R1, it lacks a detailed public breakdown of the dataset composition by percentage or specific source names. The filtering and cleaning methodologies are described in general terms ('high-quality', 'carefully curated') without providing a comprehensive breakdown of web, code, and book proportions. The 800k samples used for distillation are mentioned, but the original pre-training data remains largely opaque beyond general categories.
Tokenizer Integrity
The tokenizer is publicly accessible via the official GitHub repository and Hugging Face. It uses a Byte-Pair Encoding (BPE) approach with a stated vocabulary size of approximately 129,280 tokens (though some configuration files show 151,665 to match embedding sizes). The tokenizer's alignment with the claimed multilingual and technical (code/math) capabilities is verifiable through the provided code and model files. Documentation exists for tokenization behavior, including the handling of special tokens for reasoning traces (<think> tags).
Parameter Density
DeepSeek-R1 is exemplary in its disclosure of parameter density for an MoE architecture. It explicitly states a total of 671 billion parameters with 37 billion active parameters per token. The architectural breakdown is further detailed in the technical report, clarifying the distribution of parameters between the dense layers and the MoE experts. This level of detail prevents the common 'parameter inflation' marketing trap seen in other sparse models.
Training Compute
The technical report provides specific details on training compute, stating the use of 2,048 NVIDIA H800 GPUs over a period of approximately two months for the R1 training phase. It also references the 2.788 million H800 GPU hours used for the V3 base model. While it provides hardware specifications and duration, it lacks a direct, official calculation of the total carbon footprint or a granular breakdown of energy consumption, though these can be estimated from the provided hardware and time data.
Benchmark Reproducibility
DeepSeek provides a comprehensive list of benchmark results (AIME, MATH-500, MMLU, etc.) and specifies the evaluation settings (temperature 0.6, top-p 0.95, 64 samples for pass@1). However, the full evaluation code and the exact prompts for every benchmark are not fully centralized in a single reproducible repository, and some third-party audits have noted difficulties in matching the exact reported scores without further clarification on prompt formatting (e.g., the impact of system prompts vs. user prompts).
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as DeepSeek-R1 and maintaining awareness of its versioning. It is transparent about its nature as a reasoning model and its reliance on chain-of-thought processing. There are no significant reports of the model claiming to be a competitor's product or denying its AI nature in official deployments. The distinction between the 'Zero' and standard R1 variants is clearly maintained in its self-identification.
License Clarity
DeepSeek-R1 is released under the MIT License, which is one of the most transparent and permissive open-source licenses available. The license terms are explicitly stated in the GitHub repository and Hugging Face model cards, clearly allowing for commercial use, modification, and redistribution. There are no conflicting 'open-weight' custom licenses or hidden commercial restrictions that override the MIT terms for the 671B model.
Hardware Footprint
Hardware requirements are well-documented for various deployment scenarios. The documentation specifies that the full 671B model requires significant VRAM (~1.3TB in FP16, reduced with quantization) and recommends multi-GPU setups (e.g., 16x A100 80GB). Quantization tradeoffs are discussed by the community and supported by official model formats (GGUF, etc.), though official documentation could be more detailed regarding the specific accuracy-loss curves for different quantization levels (4-bit vs 8-bit) on the full 671B variant.
Versioning Drift
DeepSeek maintains a versioning system (e.g., the 0528 update) and provides changelogs for major releases. However, the frequency of silent updates to the API-hosted versions has been a point of concern for some users, and the documentation for 'drift'—specifically how alignment or safety updates affect reasoning performance over time—is not as comprehensive as the initial architectural disclosures. Semantic versioning is used, but the granularity of change documentation for minor weights updates is limited.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens