ApX logoApX logo

DeepSeek-R1 1.5B

Parameters

1.5B

Context Length

131K

Modality

Text

Architecture

Dense

License

MIT

Release Date

27 Dec 2024

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Layer Attention

Attention Heads

32

Key-Value Heads

32

Attention Head Dimension

128

Position Embedding

ROPE

RoPE Theta

10,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

2,048

Number of Layers

28

FFN Intermediate Size (Dense)

8,960

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

151,936

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 2k · Context: 131K · Vocab: 151.9kx 28 layersRMSNormPre-AttentionMulti-Layer Attention32Q / 32KV headsHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 9k+Final RMSNormOutput Logits

DeepSeek-R1 1.5B

DeepSeek-R1 is a family of reasoning-focused large language models developed by DeepSeek AI. The DeepSeek-R1-Distill-Qwen-1.5B variant represents a compact model within this family, specifically engineered to distill the complex reasoning capabilities of larger DeepSeek-R1 models into a more parameter-efficient architecture. This model is fine-tuned using extensive reasoning data generated by the higher-capacity DeepSeek-R1 models. Its primary purpose is to provide advanced language understanding and reasoning abilities in a form factor suitable for deployment in environments with more constrained computational resources.

The DeepSeek-R1-Distill-Qwen-1.5B model is constructed upon a Transformer-based architecture, deriving its foundational structure from the Qwen2.5-Math-1.5B base. This architecture integrates several key components for efficient operation, including Rotary Position Embedding (RoPE) for handling sequence length, the SwiGLU activation function, and RMSNorm for stable training. While the broader DeepSeek-R1 framework employs a Mixture-of-Experts (MoE) design, the 1.5B distilled variant utilizes a dense architecture. Its attention mechanism leverages Grouped Query Attention (GQA), which optimizes the computational efficiency of the attention process by sharing key and value projections across multiple attention heads, thereby reducing memory bandwidth requirements during inference.

This model is designed to facilitate robust performance in tasks demanding logical inference and step-by-step problem-solving. It is particularly applicable to domains such as mathematical problem-solving, code comprehension, and general text-based reasoning. The compact parameter size of the DeepSeek-R1-Distill-Qwen-1.5B model makes it suitable for deployment on standard consumer-grade hardware or edge devices, enabling local execution without extensive computational infrastructure. This characteristic broadens accessibility for researchers and developers seeking to integrate advanced reasoning functionalities into resource-sensitive applications.

About DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.


Other DeepSeek-R1 Models

Evaluation Benchmarks

No evaluation benchmarks for DeepSeek-R1 1.5B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

65 / 100

DeepSeek-R1 1.5B Model Integrity Report

Total Score

65

/ 100

B

Audit Note

The model exhibits high transparency regarding its architectural foundations and licensing, providing clear paths for local deployment and commercial use. However, significant opacity remains concerning the specific composition of its training data and the precise compute resources utilized for its distillation. While technical specifications are robust, the model's occasional identity confusion and the lack of a fully reproducible evaluation suite for this specific variant are notable weaknesses.

Upstream

20.5 / 30

Architectural Provenance

8.0 / 10

The model's base architecture is explicitly identified as Qwen2.5-Math-1.5B. Documentation in the DeepSeek-R1 technical report and GitHub repository provides a detailed breakdown of the distillation process, including the use of 800,000 reasoning traces generated by the larger DeepSeek-R1 model. Technical specifications such as the use of RoPE, SwiGLU, RMSNorm, and Grouped Query Attention (GQA) are clearly documented. However, the exact hyperparameters for the distillation fine-tuning (e.g., learning rate schedules, specific weight initialization beyond the base model) are less granular than the pre-training details of the parent models.

Dataset Composition

4.0 / 10

While the model discloses the use of 800,000 reasoning samples for distillation, the specific source and composition of the underlying training data for the Qwen2.5-Math-1.5B base are only generally described as 'large-scale' or 'diverse.' There is no public breakdown of the data by category (e.g., % code, % web, % books) or specific disclosure of the filtering and cleaning methodologies used for the distillation set. The 'cold-start' data used for the teacher model is mentioned but not publicly accessible for verification.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly available via the Hugging Face repository and GitHub. It uses the Qwen2 tokenizer with a vocabulary size of 151,643 tokens. The tokenization approach (Byte-level BPE) is well-documented, and the tokenizer files are directly inspectable, allowing for verification of language support and token normalization. The alignment between the tokenizer and the claimed reasoning capabilities is consistent with the Qwen2.5-Math base.

Model

23.0 / 40

Parameter Density

9.0 / 10

The model's parameter count is precisely stated as 1.54B total, with 1.31B non-embedding parameters. Unlike the larger MoE variants in the DeepSeek-R1 family, this variant is explicitly documented as a dense architecture, meaning all parameters are active during inference. The architectural breakdown (28 layers, 12 attention heads) is clearly provided in the model card and technical documentation.

Training Compute

3.0 / 10

Compute disclosure is primarily focused on the parent DeepSeek-V3 and R1 models (e.g., 2,048 H800 GPUs, 2.8M GPU hours). Specific compute metrics for the distillation of the 1.5B variant—such as the exact hardware hours, energy consumption, or carbon footprint for this specific fine-tuning run—are not explicitly provided. The documentation relies on the 'negligible cost' of distillation relative to the teacher model without providing verifiable data points for the 1.5B variant itself.

Benchmark Reproducibility

5.0 / 10

DeepSeek provides comprehensive benchmark results (AIME 2024, MATH-500, etc.) and specifies evaluation settings like temperature (0.6) and top-p (0.95). However, the exact evaluation code and the specific prompts/few-shot examples used for every benchmark are not fully public in a single reproducible repository. While some evaluation scripts are on GitHub, third-party verification has noted discrepancies in results, and the lack of a complete, 'one-click' reproduction suite for the distilled variants limits the score.

Identity Consistency

6.0 / 10

The model generally identifies as a DeepSeek-developed assistant. However, there are documented instances in community testing where the model (and its parent) exhibits identity confusion, occasionally claiming to be developed by OpenAI or Anthropic. This is likely due to the presence of distilled data from those models in the training set. While it has version awareness in its system prompts, these 'hallucinated' identities represent a significant transparency gap in model self-identification.

Downstream

21.5 / 30

License Clarity

9.0 / 10

The model weights and code are released under the highly permissive MIT License, which is explicitly stated in the GitHub repository and Hugging Face model card. The license terms clearly allow for commercial use, modification, and further distillation. The relationship with the upstream Apache 2.0 license of the Qwen base is also noted, ensuring no legal ambiguity for downstream users.

Hardware Footprint

7.5 / 10

Hardware requirements are well-documented by both the provider and the community. VRAM requirements for FP16 (~4GB) and various quantization levels (e.g., 4-bit requiring ~2GB) are widely available. The model card specifies the 128K context length, and third-party tools like Ollama and Unsloth provide verified memory scaling data. The only gap is the lack of official, detailed accuracy-tradeoff curves for different quantization levels provided directly by DeepSeek.

Versioning Drift

5.0 / 10

DeepSeek maintains a basic changelog and uses date-based versioning (e.g., 0120, 0528) for its API and some model releases. However, the distilled weights on Hugging Face do not follow a strict semantic versioning system, and silent updates or 'weight drifting' without detailed delta-documentation have been reported. There is no formal deprecation policy or public roadmap for version transitions for the 1.5B variant.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs