ApX logo

DeepSeek-R1 671B

Active Parameters

671B

Context Length

131.072K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT License

Release Date

27 Dec 2024

Knowledge Cutoff

-

Technical Specifications

Total Expert Parameters

37.0B

Number of Experts

64

Active Experts

6

Attention Structure

Multi-Layer Attention

Hidden Dimension Size

2048

Number of Layers

61

Attention Heads

128

Key-Value Heads

128

Activation Function

-

Normalization

-

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

DeepSeek-R1 671B

DeepSeek-R1 represents a class of advanced reasoning models developed by DeepSeek, designed to facilitate complex computational tasks and logical inference. It is built upon a Mixture-of-Experts (MoE) architecture, featuring a total of 671 billion parameters, with approximately 37 billion parameters actively engaged during each inference pass. This architecture, inherited from the DeepSeek-V3 base model, incorporates Multi-head Latent Attention (MLA) for efficient processing of extensive datasets and includes an auxiliary-loss-free strategy for effective load balancing during training. The model further leverages Multi-Token Prediction (MTP) to enhance predictive accuracy and expedite output generation.

The training methodology for DeepSeek-R1 emphasizes reinforcement learning (RL) to cultivate sophisticated reasoning capabilities. Initially, a precursor, DeepSeek-R1-Zero, demonstrated emergent reasoning behaviors such as self-verification and the generation of multi-step chain-of-thought (CoT) sequences through large-scale RL without preliminary supervised fine-tuning (SFT). DeepSeek-R1 refines this approach by integrating a small amount of 'cold-start' data prior to the RL stages, which addresses challenges observed in DeepSeek-R1-Zero, such as repetitive outputs and language mixing, thereby enhancing model stability and overall reasoning performance. The training pipeline for DeepSeek-R1 specifically incorporates two RL stages focused on discovering improved reasoning patterns and aligning with human preferences, alongside two SFT stages that initialize the model's reasoning and non-reasoning capabilities.

DeepSeek-R1 is engineered to excel in domains requiring analytical thought, including high-level mathematics, programming, and scientific inquiry. Its design supports a large context length, enabling processing of extended inputs. To broaden accessibility and deployment options, DeepSeek has also released several distilled versions of DeepSeek-R1, ranging from 1.5 billion to 70 billion parameters. These smaller models are designed to retain a significant portion of the reasoning capacity of the full model, making them suitable for environments with more constrained computational resources.

About DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.


Other DeepSeek-R1 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

Rank

#1

BenchmarkScoreRank

0.76

🥇

1

0.91

🥇

1

Agentic Coding

LiveBench Agentic

0.27

🥇

1

0.85

🥇

1

0.72

🥇

1

Web Development

WebDev Arena

1407.45

🥇

1

Professional Knowledge

MMLU Pro

0.85

🥇

1

Graduate-Level QA

GPQA

0.81

🥇

1

0.96

🥈

2

0.96

🥈

2

General Knowledge

MMLU

0.81

🥉

3

0.52

4

0.77

5

Rankings

Overall Rank

#1 🥇

Coding Rank

#2 🥈

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs