ApX logoApX logo

DeepSeek-R1 671B

Active Parameters

671B

Context Length

131.072K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT License

Release Date

27 Dec 2024

Knowledge Cutoff

-

Technical Specifications

Total Expert Parameters

37.0B

Number of Experts

64

Active Experts

6

Attention Structure

Multi-Layer Attention

Hidden Dimension Size

2048

Number of Layers

61

Attention Heads

128

Key-Value Heads

128

Activation Function

-

Normalization

-

Position Embedding

ROPE

DeepSeek-R1 671B

DeepSeek-R1 represents a class of advanced reasoning models developed by DeepSeek, designed to facilitate complex computational tasks and logical inference. It is built upon a Mixture-of-Experts (MoE) architecture, featuring a total of 671 billion parameters, with approximately 37 billion parameters actively engaged during each inference pass. This architecture, inherited from the DeepSeek-V3 base model, incorporates Multi-head Latent Attention (MLA) for efficient processing of extensive datasets and includes an auxiliary-loss-free strategy for effective load balancing during training. The model further leverages Multi-Token Prediction (MTP) to enhance predictive accuracy and expedite output generation.

The training methodology for DeepSeek-R1 emphasizes reinforcement learning (RL) to cultivate sophisticated reasoning capabilities. Initially, a precursor, DeepSeek-R1-Zero, demonstrated emergent reasoning behaviors such as self-verification and the generation of multi-step chain-of-thought (CoT) sequences through large-scale RL without preliminary supervised fine-tuning (SFT). DeepSeek-R1 refines this approach by integrating a small amount of 'cold-start' data prior to the RL stages, which addresses challenges observed in DeepSeek-R1-Zero, such as repetitive outputs and language mixing, thereby enhancing model stability and overall reasoning performance. The training pipeline for DeepSeek-R1 specifically incorporates two RL stages focused on discovering improved reasoning patterns and aligning with human preferences, alongside two SFT stages that initialize the model's reasoning and non-reasoning capabilities.

DeepSeek-R1 is engineered to excel in domains requiring analytical thought, including high-level mathematics, programming, and scientific inquiry. Its design supports a large context length, enabling processing of extended inputs. To broaden accessibility and deployment options, DeepSeek has also released several distilled versions of DeepSeek-R1, ranging from 1.5 billion to 70 billion parameters. These smaller models are designed to retain a significant portion of the reasoning capacity of the full model, making them suitable for environments with more constrained computational resources.

About DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.


Other DeepSeek-R1 Models

Evaluation Benchmarks

Rank

#54

BenchmarkScoreRank

0.96

🥉

3

0.57

4

0.96

6

Graduate-Level QA

GPQA

0.81

13

Web Development

WebDev Arena

1398

19

Rankings

Overall Rank

#54

Coding Rank

#42

Model Transparency

Total Score

B+

76 / 100

DeepSeek-R1 671B Transparency Report

Total Score

76

/ 100

B+

Audit Note

DeepSeek-R1 sets a high standard for transparency in architectural disclosure and licensing, providing a permissive MIT license and clear MoE parameter counts. While it excels in technical documentation of its training pipeline and hardware requirements, it remains relatively opaque regarding the specific sources and composition of its massive pre-training dataset. The model's commitment to open weights and detailed technical reports significantly aids reproducibility, though more granular data provenance and evaluation code would be required for a perfect score.

Upstream

20.5 / 30

Architectural Provenance

8.0 / 10

DeepSeek-R1 provides high transparency regarding its architecture, explicitly stating it is built upon the DeepSeek-V3-Base model. The technical report and GitHub documentation detail the Mixture-of-Experts (MoE) structure, the use of Multi-head Latent Attention (MLA), and the auxiliary-loss-free load balancing strategy. The training methodology is thoroughly documented, describing the transition from DeepSeek-R1-Zero (pure RL) to DeepSeek-R1 (cold-start SFT + RL). While the base model's pre-training is well-documented in the preceding V3 paper, the R1-specific modifications are clearly delineated.

Dataset Composition

4.0 / 10

While the model documentation mentions the use of 14.8 trillion tokens for the underlying V3 base model and specific 'cold-start' data (thousands of reasoning samples) for R1, it lacks a detailed public breakdown of the dataset composition by percentage or specific source names. The filtering and cleaning methodologies are described in general terms ('high-quality', 'carefully curated') without providing a comprehensive breakdown of web, code, and book proportions. The 800k samples used for distillation are mentioned, but the original pre-training data remains largely opaque beyond general categories.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the official GitHub repository and Hugging Face. It uses a Byte-Pair Encoding (BPE) approach with a stated vocabulary size of approximately 129,280 tokens (though some configuration files show 151,665 to match embedding sizes). The tokenizer's alignment with the claimed multilingual and technical (code/math) capabilities is verifiable through the provided code and model files. Documentation exists for tokenization behavior, including the handling of special tokens for reasoning traces (<think> tags).

Model

31.5 / 40

Parameter Density

9.0 / 10

DeepSeek-R1 is exemplary in its disclosure of parameter density for an MoE architecture. It explicitly states a total of 671 billion parameters with 37 billion active parameters per token. The architectural breakdown is further detailed in the technical report, clarifying the distribution of parameters between the dense layers and the MoE experts. This level of detail prevents the common 'parameter inflation' marketing trap seen in other sparse models.

Training Compute

7.5 / 10

The technical report provides specific details on training compute, stating the use of 2,048 NVIDIA H800 GPUs over a period of approximately two months for the R1 training phase. It also references the 2.788 million H800 GPU hours used for the V3 base model. While it provides hardware specifications and duration, it lacks a direct, official calculation of the total carbon footprint or a granular breakdown of energy consumption, though these can be estimated from the provided hardware and time data.

Benchmark Reproducibility

6.0 / 10

DeepSeek provides a comprehensive list of benchmark results (AIME, MATH-500, MMLU, etc.) and specifies the evaluation settings (temperature 0.6, top-p 0.95, 64 samples for pass@1). However, the full evaluation code and the exact prompts for every benchmark are not fully centralized in a single reproducible repository, and some third-party audits have noted difficulties in matching the exact reported scores without further clarification on prompt formatting (e.g., the impact of system prompts vs. user prompts).

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying itself as DeepSeek-R1 and maintaining awareness of its versioning. It is transparent about its nature as a reasoning model and its reliance on chain-of-thought processing. There are no significant reports of the model claiming to be a competitor's product or denying its AI nature in official deployments. The distinction between the 'Zero' and standard R1 variants is clearly maintained in its self-identification.

Downstream

23.5 / 30

License Clarity

10.0 / 10

DeepSeek-R1 is released under the MIT License, which is one of the most transparent and permissive open-source licenses available. The license terms are explicitly stated in the GitHub repository and Hugging Face model cards, clearly allowing for commercial use, modification, and redistribution. There are no conflicting 'open-weight' custom licenses or hidden commercial restrictions that override the MIT terms for the 671B model.

Hardware Footprint

7.0 / 10

Hardware requirements are well-documented for various deployment scenarios. The documentation specifies that the full 671B model requires significant VRAM (~1.3TB in FP16, reduced with quantization) and recommends multi-GPU setups (e.g., 16x A100 80GB). Quantization tradeoffs are discussed by the community and supported by official model formats (GGUF, etc.), though official documentation could be more detailed regarding the specific accuracy-loss curves for different quantization levels (4-bit vs 8-bit) on the full 671B variant.

Versioning Drift

6.5 / 10

DeepSeek maintains a versioning system (e.g., the 0528 update) and provides changelogs for major releases. However, the frequency of silent updates to the API-hosted versions has been a point of concern for some users, and the documentation for 'drift'—specifically how alignment or safety updates affect reasoning performance over time—is not as comprehensive as the initial architectural disclosures. Semantic versioning is used, but the granularity of change documentation for minor weights updates is limited.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs

DeepSeek-R1 671B: Specifications and GPU VRAM Requirements