Parameters
70B
Context Length
32.768K
Modality
Text
Architecture
Dense
License
MIT License
Release Date
27 Dec 2024
Knowledge Cutoff
-
Attention Structure
Multi-Layer Attention
Hidden Dimension Size
8192
Number of Layers
80
Attention Heads
112
Key-Value Heads
112
Activation Function
-
Normalization
-
Position Embedding
ROPE
VRAM requirements for different quantization methods and context sizes
DeepSeek-R1 is a family of advanced large language models developed by DeepSeek, designed with a primary focus on enhancing reasoning capabilities. The DeepSeek-R1-Distill-Llama-70B variant is a product of knowledge distillation, leveraging the reasoning strengths of the larger DeepSeek-R1 model and transferring them to a Llama-3.3-70B-Instruct base architecture. This distillation process aims to create a highly capable model that maintains the efficiency and operational characteristics of its base while inheriting sophisticated reasoning patterns.
Architecturally, DeepSeek-R1-Distill-Llama-70B is a dense transformer model, distinguishing it from the Mixture of Experts (MoE) architecture of the original DeepSeek-R1. It employs a Multi-Head Attention (MLA) mechanism with 112 attention heads, facilitating comprehensive processing of input sequences. The model integrates Rotary Position Embeddings (RoPE) for effective handling of positional information within sequences and utilizes Flash Attention for optimized computational efficiency. This configuration enables the model to process substantial context lengths, supporting complex problem-solving.
This model is engineered for general text generation, code generation, and sophisticated problem-solving across domains requiring logical inference and multi-step reasoning. Its design prioritizes efficient deployment, making it suitable for applications where computational resources are a consideration, including those on consumer-grade hardware. The DeepSeek-R1-Distill-Llama-70B is particularly adept at tasks demanding structured thought processes, such as mathematical problem-solving and generating coherent code, extending its utility across various technical and research applications.
DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.
Ranking is for Local LLMs.
Rank
#24
Benchmark | Score | Rank |
---|---|---|
Reasoning LiveBench Reasoning | 0.60 | 11 |
Data Analysis LiveBench Data Analysis | 0.61 | 12 |
Agentic Coding LiveBench Agentic | 0.07 | 13 |
Mathematics LiveBench Mathematics | 0.59 | 16 |
Coding LiveBench Coding | 0.47 | 21 |
Overall Rank
#24
Coding Rank
#28
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens