ApX logoApX logo

DeepSeek-R1 3B

Parameters

3B

Context Length

32.768K

Modality

Text

Architecture

Dense

License

Llama 3.2 Community License

Release Date

27 Dec 2024

Knowledge Cutoff

-

Technical Specifications

Attention Structure

Multi-Layer Attention

Hidden Dimension Size

3072

Number of Layers

32

Attention Heads

48

Key-Value Heads

48

Activation Function

-

Normalization

RMS Normalization

Position Embedding

ROPE

DeepSeek-R1 3B

DeepSeek-R1 3B is a compact, dense language model variant developed through a distillation process from the larger DeepSeek-R1 architecture. This model is specifically built upon the Llama 3.2-3B foundational architecture, aiming to retain robust reasoning capabilities while significantly reducing computational resource requirements. Its design integrates a specialized chat templating system, ensuring compatibility with Llama 3 formatting, alongside custom tokenization to facilitate structured output and enhanced reasoning pathways.

The development methodology for DeepSeek-R1 3B incorporates several technical optimizations crucial for efficient training and inference. These include the application of LoRA (Low-Rank Adaptation) for fine-tuning, leveraging Flash Attention for accelerated self-attention computations, and utilizing gradient checkpointing to manage memory consumption during training. This architectural synthesis enables the model to process information with efficiency, making it suitable for deployment in environments where computational resources are a constraint.

The primary use cases for DeepSeek-R1 3B center on applications that demand structured reasoning and general language understanding, such as mathematical problem-solving or comparative analysis tasks. Its distilled nature allows it to deliver performance suitable for practical applications requiring a balance of reasoning fidelity and operational efficiency.

About DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.


Other DeepSeek-R1 Models

Evaluation Benchmarks

No evaluation benchmarks for DeepSeek-R1 3B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Transparency

Total Score

B

69 / 100

DeepSeek-R1 3B Transparency Report

Total Score

69

/ 100

B

Audit Note

DeepSeek-R1 3B exhibits high transparency regarding its architectural origins and parameter density, benefiting from the extensive documentation of its teacher model. However, it suffers from significant gaps in training compute disclosure and the lack of a publicly accessible training dataset. While the model is highly verifiable through third-party benchmarks and open weights, the reliance on a complex licensing structure for its base architecture introduces some ambiguity for commercial users.

Upstream

22.0 / 30

Architectural Provenance

8.0 / 10

The model's base architecture is explicitly identified as Llama-3.2-3B. The DeepSeek-R1 technical report and associated GitHub documentation provide a clear methodology for the distillation process, which involves supervised fine-tuning (SFT) on 800,000 reasoning traces generated by the larger DeepSeek-R1 (671B) teacher model. While the base model is a standard transformer, the specific fine-tuning hyperparameters (learning rate 2e-5, batch size 16, 1 epoch) are publicly documented in the model card and repository.

Dataset Composition

5.0 / 10

The training data is described as a curated set of 800,000 samples consisting of reasoning traces (Chain-of-Thought) and non-reasoning data generated by DeepSeek-R1. While the quantity and general nature of the data are disclosed, the specific breakdown of sources (e.g., percentage of math vs. code vs. general text) is not provided in detail. Furthermore, the actual 800k dataset has not been fully released for public inspection, though the methodology for its creation via rejection sampling and verification is documented in the R1 paper.

Tokenizer Integrity

9.0 / 10

The model uses the standard Llama 3 tokenizer with a vocabulary size of 128,256 tokens. The tokenizer is publicly accessible via the Hugging Face repository and is fully compatible with existing Llama 3 infrastructure. Documentation specifies the use of custom special tokens (e.g., <think> and </think>) to facilitate the reasoning process, and these are correctly integrated into the provided chat templates.

Model

27.0 / 40

Parameter Density

9.0 / 10

The model is a dense architecture with 3.21 billion total parameters. Unlike Mixture-of-Experts (MoE) models where active parameters vary, this model's density is consistent and clearly stated. The architectural breakdown follows the standard Llama 3.2-3B configuration, and the impact of the distillation process on these parameters is well-understood within the context of SFT.

Training Compute

3.0 / 10

While the training compute for the teacher model (DeepSeek-R1) is extensively documented (2.788M H800 GPU hours), the specific compute resources used for the distillation of this 3B variant are not explicitly stated. Estimates suggest the cost is negligible compared to the teacher model, but verifiable hardware hours, carbon footprint, and specific cluster configurations for this variant are missing from official documentation.

Benchmark Reproducibility

6.0 / 10

DeepSeek provides comprehensive benchmark results (AIME, MATH-500, GPQA) in their technical report and GitHub. However, the evaluation code for the distilled variants is not as thoroughly documented as the main R1 model. While third-party verification on the Open LLM Leaderboard is available, the exact prompts and few-shot examples used for the distilled 3B variant specifically are less transparent than the main model's evaluation suite.

Identity Consistency

9.0 / 10

The model consistently identifies itself as a distilled version of DeepSeek-R1 based on the Llama architecture. It adheres to the expected reasoning format using <think> tags and does not exhibit identity confusion with other models like GPT-4. Versioning is clear, and the model's behavior aligns with its stated purpose as a reasoning-focused distilled variant.

Downstream

20.0 / 30

License Clarity

7.0 / 10

The model is governed by the Llama 3.2 Community License, as it is a derivative of Meta's Llama 3.2-3B. While DeepSeek's own code is MIT licensed, the weights are subject to Meta's restrictive terms (e.g., the 700M monthly active user limit). There is some potential for confusion between the MIT license of the R1 repository and the Llama community license of the distilled weights, though the model card explicitly clarifies the latter.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented by both the provider and the community. VRAM requirements for FP16 (~7GB) and various quantization levels (4-bit ~2.5GB) are clearly established. Documentation includes guidance on using tools like vLLM and llama.cpp, and the scaling of memory with context length (up to 128k) is publicly known.

Versioning Drift

5.0 / 10

The model follows a basic versioning scheme, but a detailed changelog or history of iterations for the 3B variant specifically is lacking. While the main DeepSeek-R1 model has seen updates (e.g., the 0528 version), the distilled variants are often released as static checkpoints without a clear roadmap for tracking or addressing behavioral drift over time.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
16k
32k

VRAM Required:

Recommended GPUs