ApX logoApX logo

DeepSeek-R1 8B

Parameters

8B

Context Length

64K

Modality

Text

Architecture

Dense

License

MIT License

Release Date

27 Dec 2024

Knowledge Cutoff

-

Technical Specifications

Attention Structure

Multi-Layer Attention

Hidden Dimension Size

4096

Number of Layers

40

Attention Heads

64

Key-Value Heads

64

Activation Function

-

Normalization

-

Position Embedding

ROPE

DeepSeek-R1 8B

DeepSeek-R1 is a family of models developed with a focus on enhancing reasoning capabilities in large language models. The foundational DeepSeek-R1-Zero model was innovated through large-scale reinforcement learning (RL) without an initial supervised fine-tuning (SFT) phase, demonstrating an emergent capacity for complex reasoning. Building upon this, the DeepSeek-R1 model refines these capabilities by incorporating multi-stage training and cold-start data prior to the RL phase, addressing initial challenges related to output readability and coherence.

The 8B variant, specifically exemplified by DeepSeek-R1-Distill-Llama-8B or DeepSeek-R1-0528-Qwen3-8B, represents a significant contribution to the field of efficient model deployment. These models are dense architectures that leverage a distillation process. This involves fine-tuning smaller, open-source base models, such as Llama or Qwen series, with high-quality reasoning data generated by the larger DeepSeek-R1 model. The objective of this distillation is to transfer the sophisticated reasoning patterns of the larger model into a more compact form, enabling the 8B variant to perform effectively in environments with constrained computational resources while maintaining strong performance in domains requiring intricate logical inference.

The DeepSeek-R1-0528 update, applied to the 8B distilled model, further refines its reasoning and inference capabilities through computational enhancements and algorithmic optimizations in the post-training phase. This iteration demonstrates improved depth of thought, reduced instances of hallucination, and enhanced support for function calling. The DeepSeek-R1 8B models are applicable across various technical use cases, including advanced AI research, automated code generation, mathematical problem-solving, and general natural language processing tasks that demand robust logical deduction.

About DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.


Other DeepSeek-R1 Models

Evaluation Benchmarks

No evaluation benchmarks for DeepSeek-R1 8B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Transparency

Total Score

B

66 / 100

DeepSeek-R1 8B Transparency Report

Total Score

66

/ 100

B

Audit Note

DeepSeek-R1 8B exhibits strong transparency regarding its architecture and licensing, benefiting from its open-weight nature and the use of a well-known base model. However, it suffers from significant opacity in its training data composition and the specific compute resources utilized for its distillation. While hardware requirements and identity are clearly defined, the reproducibility of its reasoning benchmarks remains a point of skepticism due to limited disclosure of evaluation artifacts.

Upstream

20.0 / 30

Architectural Provenance

7.5 / 10

The model is explicitly identified as a distilled version of DeepSeek-R1 using the Llama-3.1-8B-Instruct base architecture. The training methodology is documented in the DeepSeek-R1 technical report, detailing a multi-stage pipeline involving cold-start data, reinforcement learning (GRPO), and rejection sampling. While the distillation process (SFT on 800k samples) is described, the specific architectural modifications beyond the standard Llama-3.1 dense transformer structure are minimal, though the integration of the reasoning 'thinking' phase is well-documented.

Dataset Composition

4.0 / 10

DeepSeek discloses that the 8B variant was fine-tuned on 800,000 samples generated by the larger DeepSeek-R1 model. However, the composition of the original pre-training data for the underlying Llama-3.1 base is inherited from Meta's disclosures, and the specific breakdown of the 800k reasoning samples (e.g., math vs. code vs. logic proportions) is not provided in detail. The 'cold-start' data used for the teacher model is described as 'thousands' of samples but lacks a public source or comprehensive breakdown.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly available via the Hugging Face repository and is based on the Llama-3.1 tokenizer with a vocabulary size of 128,256 tokens. Documentation confirms the use of Byte-Pair Encoding (BPE). While there were initial community reports of minor config.json mismatches regarding embedding sizes in some distilled variants, the functional tokenizer is verifiable and matches the claimed language support and reasoning token (<think>) integration.

Model

26.0 / 40

Parameter Density

9.0 / 10

The model is a dense 8B parameter architecture, unlike its MoE teacher. The parameter count is clearly stated and consistent across official documentation, Hugging Face, and third-party platforms like Ollama. There is no ambiguity regarding active vs. total parameters as it does not use a sparse Mixture-of-Experts design.

Training Compute

3.0 / 10

While DeepSeek provides high-level compute figures for the teacher model (2,048 NVIDIA H800 GPUs for two months), specific compute metrics for the distillation of the 8B variant itself are vague. There is no detailed disclosure of the GPU hours, carbon footprint, or specific hardware used for the 800k-sample SFT phase of this specific 8B model.

Benchmark Reproducibility

5.0 / 10

Evaluation results for AIME, MATH-500, and LiveCodeBench are provided in the technical report. However, the exact evaluation code and full prompt sets used for the distilled variants are not as thoroughly documented as the main R1 model. Third-party verification has shown significant variance in results, and the reliance on synthetic data for distillation introduces complexities in verifying the 'cleanliness' of the evaluation process.

Identity Consistency

9.0 / 10

The model consistently identifies itself as a DeepSeek-distilled version of Llama. It maintains a clear versioning identity (e.g., the 0528 update) and does not exhibit the identity confusion common in some fine-tuned models that claim to be GPT-4 or other competitors. It is transparent about its nature as a reasoning-focused model.

Downstream

20.0 / 30

License Clarity

7.0 / 10

The model weights are released under the MIT License, which is highly permissive. However, because it is built on Llama-3.1-8B, users must also adhere to the Meta Llama 3.1 Community License Agreement. This dual-licensing creates a slight complexity for commercial users, although DeepSeek's own contributions are clearly MIT-licensed.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented by both the provider and the community. VRAM requirements for FP16 (~16-20GB) and various quantization levels (4-bit, 8-bit) are widely available. Documentation from partners like NVIDIA and community tools like Ollama provide clear guidance on running the model on consumer hardware.

Versioning Drift

5.0 / 10

DeepSeek has released versioned updates (e.g., the 0528 update), but the changelogs are relatively high-level and lack granular detail on specific weight changes or performance drift across all benchmarks. While semantic versioning is partially applied, the history of changes between the initial January release and subsequent updates is not comprehensively archived in a public changelog.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
31k
63k

VRAM Required:

Recommended GPUs