ApX logoApX logo

DeepSeek-R1 14B

Parameters

14B

Context Length

131.072K

Modality

Text

Architecture

Dense

License

MIT License

Release Date

27 Dec 2024

Knowledge Cutoff

Jul 2024

Technical Specifications

Attention

Attention Structure

Multi-Layer Attention

Attention Heads

80

Key-Value Heads

80

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

1,000,000

Sliding Window Attention

No

Sliding Window Size

131,072

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

5,120

Number of Layers

40

FFN Intermediate Size (Dense)

13,824

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

152,064

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 5.1k · Context: 131.1k · Vocab: 152.1kx 40 layersRMSNormPre-AttentionMulti-Layer Attention80Q / 80KV headsHead dim: 64+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 13.8k+Final RMSNormOutput Logits

DeepSeek-R1 14B

DeepSeek-R1-Distill-Qwen-14B is a dense large language model within the DeepSeek-R1 series, engineered for advanced reasoning capabilities. This model is a product of distillation from the formidable 671B DeepSeek-R1 (a Mixture-of-Experts model), with its foundational architecture rooted in the Qwen 2.5 14B model. The primary objective of this distillation process is to efficiently transfer sophisticated reasoning skills, particularly in the domains of mathematics and coding, from the larger DeepSeek-R1 into a more compact and computationally efficient dense model.

The technical architecture of DeepSeek-R1-Distill-Qwen-14B is based on a transformer framework. It incorporates Rotary Position Embeddings (RoPE) for effective positional encoding, utilizes SwiGLU as its activation function, and employs RMSNorm for robust normalization. The attention mechanism includes QKV bias, characteristic of the Qwen 2.5 series from which it is derived. Unlike its larger DeepSeek-R1 progenitor, this variant maintains a dense architecture, optimizing for direct parameter utilization rather than expert sparsity.

This model is designed to support a substantial context length, accommodating up to 131,072 tokens, which facilitates the processing of extensive inputs. Its application extends across various natural language processing tasks, encompassing text generation, data analysis, and the synthesis of code. The model's heritage from DeepSeek-R1 underscores its proficiency in complex reasoning tasks, making it suitable for mathematical problem-solving and programming. Furthermore, it supports both few-shot and zero-shot learning paradigms and is optimized for local deployment, offering flexibility for integration into diverse applications via an API.

About DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.


Other DeepSeek-R1 Models

Evaluation Benchmarks

No evaluation benchmarks for DeepSeek-R1 14B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

64 / 100

DeepSeek-R1 14B Model Integrity Report

Total Score

64

/ 100

B

Audit Note

The model demonstrates high transparency regarding its architectural origins and licensing, providing clear documentation on its relationship to its parent models. However, it suffers from significant opacity in its training data provenance and specific compute resource disclosure. While the model's identity and technical specifications are well-defined, the lack of public access to the distillation datasets and detailed training metrics limits a full transparency validation.

Upstream

17.0 / 30

Architectural Provenance

8.0 / 10

The model's lineage is explicitly documented as a distillation of the DeepSeek-R1 (671B) into the Qwen-2.5-14B base. The technical report details the transition from the Mixture-of-Experts architecture of the teacher model to the dense architecture of the student, including the use of specific components like RoPE, SwiGLU, and RMSNorm inherited from the Qwen-2.5 framework.

Dataset Composition

0.0 / 10

While the technical report mentions a distillation dataset of 800,000 samples, the actual data is not public. Furthermore, the foundational training data for the Qwen-2.5 base model remains undisclosed by its original developers, and the specific filtering or cleaning methodologies for the distillation corpus are described only in high-level categorical terms without verifiable samples.

Tokenizer Integrity

9.0 / 10

The model utilizes the Qwen-2.5 tokenizer, which is fully documented with a vocabulary size of 151,643 tokens. The tokenizer configuration files, including the merge rules and vocabulary mappings, are publicly accessible via the Hugging Face repository, allowing for full verification of tokenization behavior across supported languages.

Model

25.0 / 40

Parameter Density

9.0 / 10

The model is clearly defined as a 14.7 billion parameter dense architecture. Unlike its MoE progenitor, all parameters are active during inference. Detailed architectural configurations, including the number of layers (48), hidden dimensions (5120), and attention heads (40), are explicitly provided in the public configuration files.

Training Compute

3.0 / 10

The technical report provides general information about the hardware cluster used for the DeepSeek-R1 family (H100 GPUs), but it fails to disclose the specific compute hours, energy consumption, or carbon footprint associated with the distillation of the 14B variant specifically. The information provided is high-level and lacks granular resource accounting.

Benchmark Reproducibility

4.0 / 10

DeepSeek provides extensive benchmark results across standard sets like AIME, MATH, and MMLU. However, the specific evaluation scripts and the exact few-shot prompts used for the distilled variants are not as comprehensively documented as the main model, and third-party verification of the distillation-specific gains is still emerging.

Identity Consistency

9.0 / 10

The model consistently identifies itself as a distilled version of DeepSeek-R1 and correctly attributes its base architecture to Qwen. It maintains a clear distinction between its capabilities as a reasoning-optimized model and its identity as an AI developed by DeepSeek, with minimal confusion regarding its versioning.

Downstream

22.0 / 30

License Clarity

10.0 / 10

The model is released under the MIT License, which is one of the most permissive and transparent open-source licenses available. This license is clearly stated in the GitHub repository and on Hugging Face, explicitly allowing for commercial use, modification, and distribution without conflicting proprietary terms.

Hardware Footprint

7.0 / 10

Official documentation provides the model size and basic VRAM requirements for standard precision (BF16). While it lacks a comprehensive official matrix for quantization-specific performance trade-offs or context-length memory scaling, the community-driven deployment data for this specific 14B variant is extensive and verifiable.

Versioning Drift

5.0 / 10

The model follows a clear naming convention within the R1 family, but there is no formal semantic versioning system or public changelog for the distilled weights. Updates to the model family are frequent, but the specific 14B variant lacks a documented history of iterations or a clear deprecation path for previous checkpoints.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs

DeepSeek-R1 14B: Specifications and GPU VRAM Requirements