ApX logoApX logo

DeepSeek-R1 3B

Parameters

3B

Context Length

33K

Modality

Text

Architecture

Dense

License

Llama 3.2 Community License

Release Date

27 Dec 2024

Knowledge Cutoff

-

System Requirements

VRAM requirements for different quantization methods and context sizes

1,024 tokens

8.65 GB VRAM

Consumer

1x RTX 4090

24GB VRAM

Datacenter

1x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

32,768 tokens

34.86 GB VRAM

Consumer

2x RTX 4090

24GB VRAM

Datacenter

1x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 3.1k · Context: 33K · Vocab: 152.1kx 32 layersRMSNormPre-AttentionMulti-Layer Attention48Q / 48KV heads · SW: 4.1kHead dim: 64+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 18.9k+Final RMSNormOutput Logits

Evaluation Benchmarks

No evaluation benchmarks for DeepSeek-R1 3B available.

Rankings

Overall Rank

-

Coding Rank

-

About DeepSeek-R1 3B

DeepSeek-R1 3B is a compact, dense language model variant developed through a distillation process from the larger DeepSeek-R1 architecture. This model is specifically built upon the Llama 3.2-3B foundational architecture, aiming to retain robust reasoning capabilities while significantly reducing computational resource requirements. Its design integrates a specialized chat templating system, ensuring compatibility with Llama 3 formatting, alongside custom tokenization to facilitate structured output and enhanced reasoning pathways.

The development methodology for DeepSeek-R1 3B incorporates several technical optimizations crucial for efficient training and inference. These include the application of LoRA (Low-Rank Adaptation) for fine-tuning, leveraging Flash Attention for accelerated self-attention computations, and utilizing gradient checkpointing to manage memory consumption during training. This architectural synthesis enables the model to process information with efficiency, making it suitable for deployment in environments where computational resources are a constraint.

The primary use cases for DeepSeek-R1 3B center on applications that demand structured reasoning and general language understanding, such as mathematical problem-solving or comparative analysis tasks. Its distilled nature allows it to deliver performance suitable for practical applications requiring a balance of reasoning fidelity and operational efficiency.

Technical Specifications

Attention

Attention Structure

Multi-Layer Attention

Attention Heads

48

Key-Value Heads

48

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

10,000

Sliding Window Attention

Yes

Sliding Window Size

4,096

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

3,072

Number of Layers

32

FFN Intermediate Size (Dense)

18,944

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

152,064

Model Integrity

Total Score

B

69 / 100

About DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.


Other DeepSeek-R1 Models