Parameters
32B
Context Length
131.072K
Modality
Text
Architecture
Dense
License
MIT License
Release Date
27 Dec 2024
Knowledge Cutoff
Jul 2024
Attention Structure
Multi-Layer Attention
Hidden Dimension Size
8192
Number of Layers
60
Attention Heads
96
Key-Value Heads
96
Activation Function
Swish
Normalization
RMS Normalization
Position Embedding
ROPE
VRAM requirements for different quantization methods and context sizes
The DeepSeek-R1-Distill-Qwen-32B model represents a significant contribution to the field of large language models, specifically engineered for advanced reasoning tasks. This model is a distilled version that leverages the sophisticated reasoning capabilities of the larger DeepSeek-R1 model, transferring them into a more efficient 32-billion parameter architecture. It is built upon the Qwen2.5 series base model and fine-tuned using 800,000 curated reasoning samples generated by the original DeepSeek-R1, enabling it to perform complex problem-solving with a reduced parameter count suitable for broader deployment.
From an architectural standpoint, DeepSeek-R1-Distill-Qwen-32B is a dense transformer model. It incorporates the RoPE (Rotary Position Embedding) mechanism for handling sequence position information and utilizes FlashAttention-2 for optimized attention computation, enhancing efficiency and throughput. The model is designed with a context length of up to 131,072 tokens, allowing for processing and generation of extended sequences crucial for detailed analytical tasks. This architectural design prioritizes effective reasoning and generation while maintaining a manageable computational footprint.
The model's primary use cases include complex problem-solving, advanced mathematical reasoning, and robust coding performance across multiple programming languages. It is compatible with popular deployment frameworks such as vLLM and SGLang, facilitating its integration into various applications and research initiatives. The DeepSeek-R1-Distill-Qwen-32B model is released under the MIT License, which supports commercial use and permits modifications and derivative works, including further distillation. This licensing approach promotes open research and widespread adoption within the machine learning community.
DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.
Ranking is for Local LLMs.
Rank
#33
Benchmark | Score | Rank |
---|---|---|
Reasoning LiveBench Reasoning | 0.44 | 14 |
Agentic Coding LiveBench Agentic | 0.05 | 14 |
Mathematics LiveBench Mathematics | 0.60 | 15 |
Coding LiveBench Coding | 0.47 | 20 |
Data Analysis LiveBench Data Analysis | 0.47 | 24 |
Overall Rank
#33
Coding Rank
#27
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens