ApX logoApX logo

DeepSeek-R1 32B

Parameters

32B

Context Length

131.072K

Modality

Text

Architecture

Dense

License

MIT License

Release Date

27 Dec 2024

Knowledge Cutoff

Jul 2024

Technical Specifications

Attention Structure

Multi-Layer Attention

Hidden Dimension Size

8192

Number of Layers

60

Attention Heads

96

Key-Value Heads

96

Activation Function

Swish

Normalization

RMS Normalization

Position Embedding

ROPE

DeepSeek-R1 32B

The DeepSeek-R1-Distill-Qwen-32B model represents a significant contribution to the field of large language models, specifically engineered for advanced reasoning tasks. This model is a distilled version that leverages the sophisticated reasoning capabilities of the larger DeepSeek-R1 model, transferring them into a more efficient 32-billion parameter architecture. It is built upon the Qwen2.5 series base model and fine-tuned using 800,000 curated reasoning samples generated by the original DeepSeek-R1, enabling it to perform complex problem-solving with a reduced parameter count suitable for broader deployment.

From an architectural standpoint, DeepSeek-R1-Distill-Qwen-32B is a dense transformer model. It incorporates the RoPE (Rotary Position Embedding) mechanism for handling sequence position information and utilizes FlashAttention-2 for optimized attention computation, enhancing efficiency and throughput. The model is designed with a context length of up to 131,072 tokens, allowing for processing and generation of extended sequences crucial for detailed analytical tasks. This architectural design prioritizes effective reasoning and generation while maintaining a manageable computational footprint.

The model's primary use cases include complex problem-solving, advanced mathematical reasoning, and robust coding performance across multiple programming languages. It is compatible with popular deployment frameworks such as vLLM and SGLang, facilitating its integration into various applications and research initiatives. The DeepSeek-R1-Distill-Qwen-32B model is released under the MIT License, which supports commercial use and permits modifications and derivative works, including further distillation. This licensing approach promotes open research and widespread adoption within the machine learning community.

About DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.


Other DeepSeek-R1 Models

Evaluation Benchmarks

No evaluation benchmarks for DeepSeek-R1 32B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Transparency

Total Score

B

67 / 100

DeepSeek-R1 32B Transparency Report

Total Score

67

/ 100

B

Audit Note

DeepSeek-R1-Distill-Qwen-32B demonstrates strong transparency regarding its architecture and licensing, providing a clear roadmap of its distillation from a larger reasoning model. However, it remains opaque concerning the specific data sources used for pre-training and the precise compute resources dedicated to this variant. While the model's identity and hardware requirements are well-defined, improvements in versioning and benchmark reproducibility are needed to meet exemplary standards.

Upstream

19.5 / 30

Architectural Provenance

8.0 / 10

The model is explicitly identified as a distillation of DeepSeek-R1 into the Qwen2.5-32B base architecture. DeepSeek provides a detailed technical paper and GitHub repository outlining the multi-stage training pipeline, which includes cold-start data, large-scale reinforcement learning (RL) for the teacher model, and subsequent supervised fine-tuning (SFT) for the distilled variants. The architectural transition from the teacher's Mixture-of-Experts (MoE) to the student's dense transformer structure is well-documented.

Dataset Composition

4.5 / 10

While the model discloses the use of 800,000 curated reasoning samples generated by DeepSeek-R1 for distillation, the specific composition, sources, and filtering criteria of the original pre-training data for the Qwen2.5 base or the 'cold-start' data are not fully detailed. There is a lack of granular breakdown (e.g., percentages of code, web, or academic data) and no public access to the full training datasets, though the methodology for generating the reasoning traces is described.

Tokenizer Integrity

7.0 / 10

The model utilizes the Qwen2.5 tokenizer, which is publicly accessible with a known vocabulary size of 151,665 tokens. Documentation exists regarding slight configuration changes for the R1 series. However, there have been reported discrepancies between the 'config.json' embedding size (152,064) and the actual tokenizer vocabulary, which, while common in the industry, indicates minor documentation misalignment.

Model

26.0 / 40

Parameter Density

9.0 / 10

The model is clearly defined as a dense transformer with 32.5 billion total parameters. Unlike the MoE teacher model, all parameters are active during inference, and this distinction is explicitly stated in official documentation. The architectural specifications (64 layers, 8192 hidden dimension) are verifiable through the model's configuration files on Hugging Face.

Training Compute

3.0 / 10

DeepSeek provides high-level compute estimates for the primary DeepSeek-V3/R1 training (e.g., 2.78M GPU hours on H800 clusters), but specific compute resources, duration, and environmental impact data for the distillation of the 32B variant specifically are not disclosed. The $6 million development cost claim is a marketing figure rather than a technical compute breakdown.

Benchmark Reproducibility

5.0 / 10

DeepSeek provides comprehensive benchmark results (AIME, MATH-500, etc.) and specifies some evaluation parameters like temperature (0.6) and top-p (0.95). However, the evaluation code is modified from third-party sources (SkyThought), and independent researchers have noted significant sensitivity to prompt formatting and system instructions, making exact reproduction difficult without more standardized, versioned evaluation scripts.

Identity Consistency

9.0 / 10

The model consistently identifies as a member of the DeepSeek-R1 family and is transparent about its distilled nature and base architecture. It successfully differentiates its capabilities from the full 671B MoE model and maintains a coherent identity across official platforms and API responses.

Downstream

21.0 / 30

License Clarity

9.5 / 10

The model is released under the highly permissive MIT License, which is explicitly stated in the repository and model cards. This license clearly allows for commercial use, modification, and further distillation. The relationship between the student's MIT license and the base Qwen2.5 Apache 2.0 license is clearly navigated in the documentation.

Hardware Footprint

7.5 / 10

VRAM requirements are well-documented by both the provider and the community for various precisions (e.g., ~15GB for 4-bit, ~68GB for FP16). The impact of context length on memory is noted, and the model is widely supported by deployment frameworks like vLLM and Ollama, which provide additional hardware guidance.

Versioning Drift

4.0 / 10

While the model is hosted on Hugging Face with basic commit history, it lacks a formal semantic versioning system or a detailed public changelog for weight updates. Users have reported 'silent' updates or redirections on some API platforms, making it difficult to track behavioral drift or access specific historical checkpoints reliably.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs