ApX logo

DeepSeek-R1 32B

Parameters

32B

Context Length

131.072K

Modality

Text

Architecture

Dense

License

MIT License

Release Date

27 Dec 2024

Knowledge Cutoff

Jul 2024

Technical Specifications

Attention Structure

Multi-Layer Attention

Hidden Dimension Size

8192

Number of Layers

60

Attention Heads

96

Key-Value Heads

96

Activation Function

Swish

Normalization

RMS Normalization

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

DeepSeek-R1 32B

The DeepSeek-R1-Distill-Qwen-32B model represents a significant contribution to the field of large language models, specifically engineered for advanced reasoning tasks. This model is a distilled version that leverages the sophisticated reasoning capabilities of the larger DeepSeek-R1 model, transferring them into a more efficient 32-billion parameter architecture. It is built upon the Qwen2.5 series base model and fine-tuned using 800,000 curated reasoning samples generated by the original DeepSeek-R1, enabling it to perform complex problem-solving with a reduced parameter count suitable for broader deployment.

From an architectural standpoint, DeepSeek-R1-Distill-Qwen-32B is a dense transformer model. It incorporates the RoPE (Rotary Position Embedding) mechanism for handling sequence position information and utilizes FlashAttention-2 for optimized attention computation, enhancing efficiency and throughput. The model is designed with a context length of up to 131,072 tokens, allowing for processing and generation of extended sequences crucial for detailed analytical tasks. This architectural design prioritizes effective reasoning and generation while maintaining a manageable computational footprint.

The model's primary use cases include complex problem-solving, advanced mathematical reasoning, and robust coding performance across multiple programming languages. It is compatible with popular deployment frameworks such as vLLM and SGLang, facilitating its integration into various applications and research initiatives. The DeepSeek-R1-Distill-Qwen-32B model is released under the MIT License, which supports commercial use and permits modifications and derivative works, including further distillation. This licensing approach promotes open research and widespread adoption within the machine learning community.

About DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.


Other DeepSeek-R1 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

Rank

#33

BenchmarkScoreRank

0.44

14

Agentic Coding

LiveBench Agentic

0.05

14

0.60

15

0.47

20

0.47

24

Rankings

Overall Rank

#33

Coding Rank

#27

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs