Parameters
3B
Context Length
32.768K
Modality
Text
Architecture
Dense
License
Llama 3.2 Community License
Release Date
27 Dec 2024
Knowledge Cutoff
-
Attention Structure
Multi-Layer Attention
Hidden Dimension Size
3072
Number of Layers
32
Attention Heads
48
Key-Value Heads
48
Activation Function
-
Normalization
RMS Normalization
Position Embedding
ROPE
DeepSeek-R1 3B is a compact, dense language model variant developed through a distillation process from the larger DeepSeek-R1 architecture. This model is specifically built upon the Llama 3.2-3B foundational architecture, aiming to retain robust reasoning capabilities while significantly reducing computational resource requirements. Its design integrates a specialized chat templating system, ensuring compatibility with Llama 3 formatting, alongside custom tokenization to facilitate structured output and enhanced reasoning pathways.
The development methodology for DeepSeek-R1 3B incorporates several technical optimizations crucial for efficient training and inference. These include the application of LoRA (Low-Rank Adaptation) for fine-tuning, leveraging Flash Attention for accelerated self-attention computations, and utilizing gradient checkpointing to manage memory consumption during training. This architectural synthesis enables the model to process information with efficiency, making it suitable for deployment in environments where computational resources are a constraint.
The primary use cases for DeepSeek-R1 3B center on applications that demand structured reasoning and general language understanding, such as mathematical problem-solving or comparative analysis tasks. Its distilled nature allows it to deliver performance suitable for practical applications requiring a balance of reasoning fidelity and operational efficiency.
DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.
No evaluation benchmarks for DeepSeek-R1 3B available.
Overall Rank
-
Coding Rank
-
Total Score
69
/ 100
DeepSeek-R1 3B exhibits high transparency regarding its architectural origins and parameter density, benefiting from the extensive documentation of its teacher model. However, it suffers from significant gaps in training compute disclosure and the lack of a publicly accessible training dataset. While the model is highly verifiable through third-party benchmarks and open weights, the reliance on a complex licensing structure for its base architecture introduces some ambiguity for commercial users.
Architectural Provenance
The model's base architecture is explicitly identified as Llama-3.2-3B. The DeepSeek-R1 technical report and associated GitHub documentation provide a clear methodology for the distillation process, which involves supervised fine-tuning (SFT) on 800,000 reasoning traces generated by the larger DeepSeek-R1 (671B) teacher model. While the base model is a standard transformer, the specific fine-tuning hyperparameters (learning rate 2e-5, batch size 16, 1 epoch) are publicly documented in the model card and repository.
Dataset Composition
The training data is described as a curated set of 800,000 samples consisting of reasoning traces (Chain-of-Thought) and non-reasoning data generated by DeepSeek-R1. While the quantity and general nature of the data are disclosed, the specific breakdown of sources (e.g., percentage of math vs. code vs. general text) is not provided in detail. Furthermore, the actual 800k dataset has not been fully released for public inspection, though the methodology for its creation via rejection sampling and verification is documented in the R1 paper.
Tokenizer Integrity
The model uses the standard Llama 3 tokenizer with a vocabulary size of 128,256 tokens. The tokenizer is publicly accessible via the Hugging Face repository and is fully compatible with existing Llama 3 infrastructure. Documentation specifies the use of custom special tokens (e.g., <think> and </think>) to facilitate the reasoning process, and these are correctly integrated into the provided chat templates.
Parameter Density
The model is a dense architecture with 3.21 billion total parameters. Unlike Mixture-of-Experts (MoE) models where active parameters vary, this model's density is consistent and clearly stated. The architectural breakdown follows the standard Llama 3.2-3B configuration, and the impact of the distillation process on these parameters is well-understood within the context of SFT.
Training Compute
While the training compute for the teacher model (DeepSeek-R1) is extensively documented (2.788M H800 GPU hours), the specific compute resources used for the distillation of this 3B variant are not explicitly stated. Estimates suggest the cost is negligible compared to the teacher model, but verifiable hardware hours, carbon footprint, and specific cluster configurations for this variant are missing from official documentation.
Benchmark Reproducibility
DeepSeek provides comprehensive benchmark results (AIME, MATH-500, GPQA) in their technical report and GitHub. However, the evaluation code for the distilled variants is not as thoroughly documented as the main R1 model. While third-party verification on the Open LLM Leaderboard is available, the exact prompts and few-shot examples used for the distilled 3B variant specifically are less transparent than the main model's evaluation suite.
Identity Consistency
The model consistently identifies itself as a distilled version of DeepSeek-R1 based on the Llama architecture. It adheres to the expected reasoning format using <think> tags and does not exhibit identity confusion with other models like GPT-4. Versioning is clear, and the model's behavior aligns with its stated purpose as a reasoning-focused distilled variant.
License Clarity
The model is governed by the Llama 3.2 Community License, as it is a derivative of Meta's Llama 3.2-3B. While DeepSeek's own code is MIT licensed, the weights are subject to Meta's restrictive terms (e.g., the 700M monthly active user limit). There is some potential for confusion between the MIT license of the R1 repository and the Llama community license of the distilled weights, though the model card explicitly clarifies the latter.
Hardware Footprint
Hardware requirements are well-documented by both the provider and the community. VRAM requirements for FP16 (~7GB) and various quantization levels (4-bit ~2.5GB) are clearly established. Documentation includes guidance on using tools like vLLM and llama.cpp, and the scaling of memory with context length (up to 128k) is publicly known.
Versioning Drift
The model follows a basic versioning scheme, but a detailed changelog or history of iterations for the 3B variant specifically is lacking. While the main DeepSeek-R1 model has seen updates (e.g., the 0528 version), the distilled variants are often released as static checkpoints without a clear roadmap for tracking or addressing behavioral drift over time.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens