ApX logo

DeepSeek-R1 8B

Parameters

8B

Context Length

64K

Modality

Text

Architecture

Dense

License

MIT License

Release Date

27 Dec 2024

Knowledge Cutoff

-

Technical Specifications

Attention Structure

Multi-Layer Attention

Hidden Dimension Size

4096

Number of Layers

40

Attention Heads

64

Key-Value Heads

64

Activation Function

-

Normalization

-

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

DeepSeek-R1 8B

DeepSeek-R1 is a family of models developed with a focus on enhancing reasoning capabilities in large language models. The foundational DeepSeek-R1-Zero model was innovated through large-scale reinforcement learning (RL) without an initial supervised fine-tuning (SFT) phase, demonstrating an emergent capacity for complex reasoning. Building upon this, the DeepSeek-R1 model refines these capabilities by incorporating multi-stage training and cold-start data prior to the RL phase, addressing initial challenges related to output readability and coherence.

The 8B variant, specifically exemplified by DeepSeek-R1-Distill-Llama-8B or DeepSeek-R1-0528-Qwen3-8B, represents a significant contribution to the field of efficient model deployment. These models are dense architectures that leverage a distillation process. This involves fine-tuning smaller, open-source base models, such as Llama or Qwen series, with high-quality reasoning data generated by the larger DeepSeek-R1 model. The objective of this distillation is to transfer the sophisticated reasoning patterns of the larger model into a more compact form, enabling the 8B variant to perform effectively in environments with constrained computational resources while maintaining strong performance in domains requiring intricate logical inference.

The DeepSeek-R1-0528 update, applied to the 8B distilled model, further refines its reasoning and inference capabilities through computational enhancements and algorithmic optimizations in the post-training phase. This iteration demonstrates improved depth of thought, reduced instances of hallucination, and enhanced support for function calling. The DeepSeek-R1 8B models are applicable across various technical use cases, including advanced AI research, automated code generation, mathematical problem-solving, and general natural language processing tasks that demand robust logical deduction.

About DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.


Other DeepSeek-R1 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

No evaluation benchmarks for DeepSeek-R1 8B available.

Rankings

Overall Rank

-

Coding Rank

-

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
31k
63k

VRAM Required:

Recommended GPUs