ApX logo

DeepSeek-R1 7B

Parameters

7B

Context Length

131.072K

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

27 Dec 2024

Knowledge Cutoff

-

Technical Specifications

Attention Structure

Multi-Layer Attention

Hidden Dimension Size

4096

Number of Layers

32

Attention Heads

64

Key-Value Heads

64

Activation Function

-

Normalization

RMS Normalization

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

DeepSeek-R1 7B

DeepSeek-R1-Distill-Qwen-7B is a 7-billion parameter language model engineered by DeepSeek AI. This model variant is a dense architecture, derived through a knowledge distillation process from the larger DeepSeek-R1 system. Its primary design objective is to deliver robust reasoning capabilities, specializing in domains such as mathematical reasoning, logical analysis, and the generation of code. The distillation methodology enables this model to encapsulate advanced problem-solving proficiencies within a more computationally efficient format, making it suitable for deployment in scenarios where resource constraints necessitate a smaller footprint without significant degradation in reasoning performance.

The architectural foundation of DeepSeek-R1-Distill-Qwen-7B is based on the Qwen2.5-Math-7B model. The training regimen for this distilled model emphasizes the transfer of sophisticated reasoning behaviors from the DeepSeek-R1 teacher model. This process leverages a substantial dataset comprising approximately 800,000 curated samples. These samples, generated by the higher-capacity DeepSeek-R1, are bifurcated into approximately 600,000 reasoning-focused examples and 200,000 non-reasoning examples, facilitating a targeted transfer of cognitive patterns. The model employs Multi-Head Latent Attention (MLA) and integrates Rotary Position Embeddings (RoPE) for positional encoding, with context extension techniques such as YaRN used to scale its operational context.

In terms of practical application, DeepSeek-R1-Distill-Qwen-7B is configured to support extended contextual understanding, processing input sequences up to 131,072 tokens. This expanded context window enhances its capacity for handling complex, multi-step problems that necessitate a broad understanding of the input. The model is positioned for use in a variety of technical applications requiring analytical precision, including automated theorem proving, complex algorithmic problem-solving, and advanced programming assistance. Its compact design, coupled with its specialized reasoning aptitude, makes it a viable candidate for integration into systems requiring localized inference or deployment on consumer-grade hardware.

About DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.


Other DeepSeek-R1 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

No evaluation benchmarks for DeepSeek-R1 7B available.

Rankings

Overall Rank

-

Coding Rank

-

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs