趋近智
参数
7B
上下文长度
131.072K
模态
Text
架构
Dense
许可证
Apache 2.0
发布日期
27 Dec 2024
知识截止
-
注意力结构
Multi-Layer Attention
隐藏维度大小
4096
层数
32
注意力头
64
键值头
64
激活函数
-
归一化
RMS Normalization
位置嵌入
ROPE
不同量化方法和上下文大小的显存要求
DeepSeek-R1-Distill-Qwen-7B is a 7-billion parameter language model engineered by DeepSeek AI. This model variant is a dense architecture, derived through a knowledge distillation process from the larger DeepSeek-R1 system. Its primary design objective is to deliver robust reasoning capabilities, specializing in domains such as mathematical reasoning, logical analysis, and the generation of code. The distillation methodology enables this model to encapsulate advanced problem-solving proficiencies within a more computationally efficient format, making it suitable for deployment in scenarios where resource constraints necessitate a smaller footprint without significant degradation in reasoning performance.
The architectural foundation of DeepSeek-R1-Distill-Qwen-7B is based on the Qwen2.5-Math-7B model. The training regimen for this distilled model emphasizes the transfer of sophisticated reasoning behaviors from the DeepSeek-R1 teacher model. This process leverages a substantial dataset comprising approximately 800,000 curated samples. These samples, generated by the higher-capacity DeepSeek-R1, are bifurcated into approximately 600,000 reasoning-focused examples and 200,000 non-reasoning examples, facilitating a targeted transfer of cognitive patterns. The model employs Multi-Head Latent Attention (MLA) and integrates Rotary Position Embeddings (RoPE) for positional encoding, with context extension techniques such as YaRN used to scale its operational context.
In terms of practical application, DeepSeek-R1-Distill-Qwen-7B is configured to support extended contextual understanding, processing input sequences up to 131,072 tokens. This expanded context window enhances its capacity for handling complex, multi-step problems that necessitate a broad understanding of the input. The model is positioned for use in a variety of technical applications requiring analytical precision, including automated theorem proving, complex algorithmic problem-solving, and advanced programming assistance. Its compact design, coupled with its specialized reasoning aptitude, makes it a viable candidate for integration into systems requiring localized inference or deployment on consumer-grade hardware.
DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.
排名适用于本地LLM。
没有可用的 DeepSeek-R1 7B 评估基准。