趋近智
参数
8B
上下文长度
64K
模态
Text
架构
Dense
许可证
MIT License
发布日期
27 Dec 2024
知识截止
-
注意力结构
Multi-Layer Attention
隐藏维度大小
4096
层数
40
注意力头
64
键值头
64
激活函数
-
归一化
-
位置嵌入
ROPE
不同量化方法和上下文大小的显存要求
DeepSeek-R1 is a family of models developed with a focus on enhancing reasoning capabilities in large language models. The foundational DeepSeek-R1-Zero model was innovated through large-scale reinforcement learning (RL) without an initial supervised fine-tuning (SFT) phase, demonstrating an emergent capacity for complex reasoning. Building upon this, the DeepSeek-R1 model refines these capabilities by incorporating multi-stage training and cold-start data prior to the RL phase, addressing initial challenges related to output readability and coherence.
The 8B variant, specifically exemplified by DeepSeek-R1-Distill-Llama-8B or DeepSeek-R1-0528-Qwen3-8B, represents a significant contribution to the field of efficient model deployment. These models are dense architectures that leverage a distillation process. This involves fine-tuning smaller, open-source base models, such as Llama or Qwen series, with high-quality reasoning data generated by the larger DeepSeek-R1 model. The objective of this distillation is to transfer the sophisticated reasoning patterns of the larger model into a more compact form, enabling the 8B variant to perform effectively in environments with constrained computational resources while maintaining strong performance in domains requiring intricate logical inference.
The DeepSeek-R1-0528 update, applied to the 8B distilled model, further refines its reasoning and inference capabilities through computational enhancements and algorithmic optimizations in the post-training phase. This iteration demonstrates improved depth of thought, reduced instances of hallucination, and enhanced support for function calling. The DeepSeek-R1 8B models are applicable across various technical use cases, including advanced AI research, automated code generation, mathematical problem-solving, and general natural language processing tasks that demand robust logical deduction.
DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.
排名适用于本地LLM。
没有可用的 DeepSeek-R1 8B 评估基准。