趋近智
参数
70B
上下文长度
32.768K
模态
Text
架构
Dense
许可证
MIT License
发布日期
27 Dec 2024
知识截止
-
注意力结构
Multi-Layer Attention
隐藏维度大小
8192
层数
80
注意力头
112
键值头
112
激活函数
-
归一化
-
位置嵌入
ROPE
不同量化方法和上下文大小的显存要求
DeepSeek-R1 is a family of advanced large language models developed by DeepSeek, designed with a primary focus on enhancing reasoning capabilities. The DeepSeek-R1-Distill-Llama-70B variant is a product of knowledge distillation, leveraging the reasoning strengths of the larger DeepSeek-R1 model and transferring them to a Llama-3.3-70B-Instruct base architecture. This distillation process aims to create a highly capable model that maintains the efficiency and operational characteristics of its base while inheriting sophisticated reasoning patterns.
Architecturally, DeepSeek-R1-Distill-Llama-70B is a dense transformer model, distinguishing it from the Mixture of Experts (MoE) architecture of the original DeepSeek-R1. It employs a Multi-Head Attention (MLA) mechanism with 112 attention heads, facilitating comprehensive processing of input sequences. The model integrates Rotary Position Embeddings (RoPE) for effective handling of positional information within sequences and utilizes Flash Attention for optimized computational efficiency. This configuration enables the model to process substantial context lengths, supporting complex problem-solving.
This model is engineered for general text generation, code generation, and sophisticated problem-solving across domains requiring logical inference and multi-step reasoning. Its design prioritizes efficient deployment, making it suitable for applications where computational resources are a consideration, including those on consumer-grade hardware. The DeepSeek-R1-Distill-Llama-70B is particularly adept at tasks demanding structured thought processes, such as mathematical problem-solving and generating coherent code, extending its utility across various technical and research applications.
DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.
排名适用于本地LLM。
排名
#24
基准 | 分数 | 排名 |
---|---|---|
Reasoning LiveBench Reasoning | 0.60 | 11 |
Data Analysis LiveBench Data Analysis | 0.61 | 12 |
Agentic Coding LiveBench Agentic | 0.07 | 13 |
Mathematics LiveBench Mathematics | 0.59 | 16 |
Coding LiveBench Coding | 0.47 | 21 |