Parameters
3B
Context Length
33K
Modality
Text
Architecture
Dense
License
Llama 3.2 Community License
Release Date
27 Dec 2024
Knowledge Cutoff
-
VRAM requirements for different quantization methods and context sizes
1,024 tokens
Consumer
1x RTX 4090
24GB VRAM
Datacenter
1x NVIDIA A100
80GB VRAM
Apple Silicon
1x Apple M3 Max
128GB VRAM
32,768 tokens
Consumer
2x RTX 4090
24GB VRAM
Datacenter
1x NVIDIA A100
80GB VRAM
Apple Silicon
1x Apple M3 Max
128GB VRAM
No evaluation benchmarks for DeepSeek-R1 3B available.
Overall Rank
-
Coding Rank
-
DeepSeek-R1 3B is a compact, dense language model variant developed through a distillation process from the larger DeepSeek-R1 architecture. This model is specifically built upon the Llama 3.2-3B foundational architecture, aiming to retain robust reasoning capabilities while significantly reducing computational resource requirements. Its design integrates a specialized chat templating system, ensuring compatibility with Llama 3 formatting, alongside custom tokenization to facilitate structured output and enhanced reasoning pathways.
The development methodology for DeepSeek-R1 3B incorporates several technical optimizations crucial for efficient training and inference. These include the application of LoRA (Low-Rank Adaptation) for fine-tuning, leveraging Flash Attention for accelerated self-attention computations, and utilizing gradient checkpointing to manage memory consumption during training. This architectural synthesis enables the model to process information with efficiency, making it suitable for deployment in environments where computational resources are a constraint.
The primary use cases for DeepSeek-R1 3B center on applications that demand structured reasoning and general language understanding, such as mathematical problem-solving or comparative analysis tasks. Its distilled nature allows it to deliver performance suitable for practical applications requiring a balance of reasoning fidelity and operational efficiency.
Attention
Attention Structure
Multi-Layer Attention
Attention Heads
48
Key-Value Heads
48
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
10,000
Sliding Window Attention
Yes
Sliding Window Size
4,096
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
3,072
Number of Layers
32
FFN Intermediate Size (Dense)
18,944
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
152,064
DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.
APX AI
Online