趋近智
参数
32B
上下文长度
131.072K
模态
Text
架构
Dense
许可证
MIT License
发布日期
27 Dec 2024
知识截止
Jul 2024
注意力结构
Multi-Layer Attention
隐藏维度大小
8192
层数
60
注意力头
96
键值头
96
激活函数
Swish
归一化
RMS Normalization
位置嵌入
ROPE
不同量化方法和上下文大小的显存要求
The DeepSeek-R1-Distill-Qwen-32B model represents a significant contribution to the field of large language models, specifically engineered for advanced reasoning tasks. This model is a distilled version that leverages the sophisticated reasoning capabilities of the larger DeepSeek-R1 model, transferring them into a more efficient 32-billion parameter architecture. It is built upon the Qwen2.5 series base model and fine-tuned using 800,000 curated reasoning samples generated by the original DeepSeek-R1, enabling it to perform complex problem-solving with a reduced parameter count suitable for broader deployment.
From an architectural standpoint, DeepSeek-R1-Distill-Qwen-32B is a dense transformer model. It incorporates the RoPE (Rotary Position Embedding) mechanism for handling sequence position information and utilizes FlashAttention-2 for optimized attention computation, enhancing efficiency and throughput. The model is designed with a context length of up to 131,072 tokens, allowing for processing and generation of extended sequences crucial for detailed analytical tasks. This architectural design prioritizes effective reasoning and generation while maintaining a manageable computational footprint.
The model's primary use cases include complex problem-solving, advanced mathematical reasoning, and robust coding performance across multiple programming languages. It is compatible with popular deployment frameworks such as vLLM and SGLang, facilitating its integration into various applications and research initiatives. The DeepSeek-R1-Distill-Qwen-32B model is released under the MIT License, which supports commercial use and permits modifications and derivative works, including further distillation. This licensing approach promotes open research and widespread adoption within the machine learning community.
DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.
排名适用于本地LLM。
排名
#33
基准 | 分数 | 排名 |
---|---|---|
Reasoning LiveBench Reasoning | 0.44 | 14 |
Agentic Coding LiveBench Agentic | 0.05 | 14 |
Mathematics LiveBench Mathematics | 0.60 | 15 |
Coding LiveBench Coding | 0.47 | 20 |
Data Analysis LiveBench Data Analysis | 0.47 | 24 |