Active Parameters
671B
Context Length
131.072K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
DeepSeek Model License
Release Date
27 Dec 2024
Knowledge Cutoff
-
Attention
Attention Structure
Multi-Layer Attention
Attention Heads
128
Key-Value Heads
128
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
10,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
Swish
Dimensions
Hidden Dimension Size
7,168
Number of Layers
61
FFN Intermediate Size (Dense)
2,048
Multi-Token Prediction Heads
1
Tokenizer
Vocabulary Size
129,280
Mixture of Experts
Total Expert Parameters
37.0B
Number of Experts
257
Active Experts
9
Shared Experts
1
FFN Intermediate Size (per Expert)
2,048
Dense Layers Before MoE
3
DeepSeek-V3 is a large-scale Mixture-of-Experts (MoE) language model, comprising a total of 671 billion parameters with 37 billion parameters activated per token during inference. This design prioritizes efficient inference and cost-effective training. The model was pre-trained on an extensive dataset of 14.8 trillion diverse and high-quality tokens. Subsequent training phases involved Supervised Fine-Tuning and Reinforcement Learning to further enhance its capabilities. DeepSeek-V3 represents an evolution in large language model design, building on previous architectural foundations while introducing novel advancements for efficiency.
The architectural core of DeepSeek-V3 integrates several innovations. It utilizes Multi-head Latent Attention (MLA), a mechanism designed to optimize attention operations by compressing key-value pairs into a low-dimensional latent space, thereby reducing memory consumption during inference. The Mixture-of-Experts component, termed DeepSeekMoE, employs 256 routed experts and 1 shared expert, with each token dynamically interacting with 8 specialized experts plus the single shared expert. A notable innovation in this MoE architecture is an auxiliary-loss-free strategy for load balancing, which aims to distribute computational load across experts without the performance degradation typically associated with auxiliary loss functions. Additionally, DeepSeek-V3 incorporates a Multi-Token Prediction (MTP) training objective, which densifies training signals and is observed to enhance overall model performance by training the model to predict multiple future tokens simultaneously. Training further leverages FP8 mixed precision, demonstrating its feasibility and effectiveness at an extremely large scale. The model employs Rotary Positional Embedding (RoPE) for handling positional information and RMSNorm for normalization within its layers.
DeepSeek-V3 is engineered to support a broad spectrum of general language tasks, exhibiting capabilities in areas such as mathematical problem-solving, advanced code development, and complex reasoning. Its design allows for the processing of extended contexts, supporting a context length of up to 128K tokens. This enables the model to handle long documents and complex multi-turn conversations effectively. The model's efficiency in both training and inference makes it suitable for applications requiring substantial computational capacity while maintaining resource optimization.
DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.
Rank
#53
| Benchmark | Score | Rank |
|---|---|---|
Refactoring Aider Refactoring | 0.32 | 🥈 2 |
StackEval ProLLM Stack Eval | 0.976 | 4 |
General Knowledge MMLU | 0.885 | 6 |
QA Assistant ProLLM QA Assistant | 0.953 | 9 |
Summarization ProLLM Summarization | 0.806 | 12 |
Coding Aider Coding | 0.55 | 20 |
StackUnseen ProLLM Stack Unseen | 0.439 | 27 |
Web Development WebDev Arena | 1358 | 36 |
Professional Knowledge MMLU Pro | 0.74 | 47 |
Overall Rank
#53
Coding Rank
#79
Total Score
68
/ 100
DeepSeek-V3 exhibits high transparency in its technical architecture and compute resources, providing a level of detail in its technical report that exceeds many proprietary competitors. Its primary transparency weaknesses lie in the lack of granular data provenance and the use of a custom, restrictive model license. While the model is highly verifiable through open weights, users should be mindful of rapid versioning cycles and the complexities of its multi-part licensing structure.
Architectural Provenance
DeepSeek-V3 provides exemplary architectural transparency through a detailed technical report and open-source implementation. The model explicitly documents its use of Multi-head Latent Attention (MLA) for inference efficiency and the DeepSeekMoE architecture. It provides specific details on its novel auxiliary-loss-free load balancing strategy and Multi-Token Prediction (MTP) objective. The transition from previous versions (V2) is clearly documented, and the model's 61-layer decoder-only transformer structure is fully specified in both the paper and the public GitHub repository.
Dataset Composition
While the total token count (14.8 trillion) and the general nature of the data (diverse, high-quality, multilingual) are disclosed, there is a lack of granular detail regarding the specific dataset proportions or sources. The documentation mentions 'web, code, and math' but does not provide a percentage breakdown or specific filtering/cleaning methodologies beyond general claims of curation. No sample data or specific source lists are publicly available, making the composition difficult to verify independently.
Tokenizer Integrity
The tokenizer is publicly accessible via Hugging Face and GitHub, with a confirmed vocabulary size of 129,280 tokens. It uses a byte-level Byte-Pair Encoding (BPE) approach similar to the Llama tokenizer but with custom modifications for multilingual support (English and Chinese). Documentation includes specific special tokens for tool calling and reasoning blocks, and the vocabulary is consistent across the V3 family, ensuring predictable behavior for developers.
Parameter Density
DeepSeek-V3 is transparent about its Mixture-of-Experts (MoE) structure, clearly stating a total of 671 billion parameters with 37 billion active parameters per token. The architectural breakdown (256 routed experts, 1 shared expert) is well-documented. However, it loses points because the 671B figure includes 14B parameters from the Multi-Token Prediction (MTP) module which are used during training but are optional/detachable during inference, leading to some minor ambiguity in 'total' vs 'inference' parameter counts in marketing materials.
Training Compute
The technical report provides unusually specific details on training compute, citing 2.788 million H800 GPU hours for the full training run. It discloses the hardware used (2,048 NVIDIA H800 GPUs), the training duration (approximately two months), and even provides a cost estimate (~$5.58 million). While it does not provide a formal carbon footprint calculation in the primary report, the level of compute transparency is significantly higher than most industry peers.
Benchmark Reproducibility
DeepSeek provides a comprehensive list of benchmark results (MMLU, GSM8K, HumanEval, etc.) in its technical report and GitHub. However, it lacks a unified, one-click reproduction script for all claimed figures. While evaluation settings (e.g., few-shot counts) are mentioned, the exact prompts and internal evaluation pipelines are not fully open-sourced, making exact bit-for-bit reproduction of scores challenging for independent auditors.
Identity Consistency
The model generally maintains a consistent identity as DeepSeek-V3 and correctly identifies its version and origin in most standard deployments. It is transparent about its nature as an AI and its MoE architecture. Some minor confusion has been noted in third-party agentic testing where the model occasionally struggles with self-awareness in complex scaffolds, but its core identity remains stable and verifiable through official API and weight metadata.
License Clarity
The licensing is split: the code is under the permissive MIT license, but the model weights are governed by a custom 'DeepSeek Model License.' While this license explicitly allows commercial use and derivative works, it includes 'Use-based restrictions' and 'Accountability' clauses that are more restrictive than standard Open Source Initiative (OSI) licenses. The terms are public but create a more complex legal landscape than a standard Apache 2.0 or MIT license.
Hardware Footprint
Hardware requirements are well-documented by both the provider and the community. The technical report discusses the use of FP8 mixed precision, and official guides specify VRAM requirements for various configurations (e.g., ~700GB for FP8 inference). Third-party documentation (e.g., vLLM, SGLang) provides detailed quantization trade-offs (INT4, GGUF) and multi-node requirements, though the provider's own documentation could be more centralized regarding consumer-grade hardware limits.
Versioning Drift
DeepSeek maintains a public changelog for its API and releases versioned weights (e.g., V3-0324, V3.1). However, the rapid release cycle and 'silent' updates to the hosted API (deepseek-chat) have led to reports of behavioral drift. While semantic versioning is partially used, the deprecation of older versions (like the original V3) happens quickly, sometimes leaving users with limited paths for long-term stability on a specific checkpoint.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online