ApX logoApX logo

DeepSeek-V3 671B

Active Parameters

671B

Context Length

131.072K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

DeepSeek Model License

Release Date

27 Dec 2024

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Layer Attention

Attention Heads

128

Key-Value Heads

128

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

10,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

Swish

Dimensions

Hidden Dimension Size

7,168

Number of Layers

61

FFN Intermediate Size (Dense)

2,048

Multi-Token Prediction Heads

1

Tokenizer

Vocabulary Size

129,280

Mixture of Experts

Total Expert Parameters

37.0B

Number of Experts

257

Active Experts

9

Shared Experts

1

FFN Intermediate Size (per Expert)

2,048

Dense Layers Before MoE

3

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 7.2k · Context: 131.1k · Vocab: 129.3kx 61 layersRMSNormPre-AttentionMulti-Layer Attention128Q / 128KV headsHead dim: 56+RMSNormPre-FFNSparse MoE FFN (9/257 experts)SwishIntermediate: 2k+Final RMSNormOutput Logits

DeepSeek-V3 671B

DeepSeek-V3 is a large-scale Mixture-of-Experts (MoE) language model, comprising a total of 671 billion parameters with 37 billion parameters activated per token during inference. This design prioritizes efficient inference and cost-effective training. The model was pre-trained on an extensive dataset of 14.8 trillion diverse and high-quality tokens. Subsequent training phases involved Supervised Fine-Tuning and Reinforcement Learning to further enhance its capabilities. DeepSeek-V3 represents an evolution in large language model design, building on previous architectural foundations while introducing novel advancements for efficiency.

The architectural core of DeepSeek-V3 integrates several innovations. It utilizes Multi-head Latent Attention (MLA), a mechanism designed to optimize attention operations by compressing key-value pairs into a low-dimensional latent space, thereby reducing memory consumption during inference. The Mixture-of-Experts component, termed DeepSeekMoE, employs 256 routed experts and 1 shared expert, with each token dynamically interacting with 8 specialized experts plus the single shared expert. A notable innovation in this MoE architecture is an auxiliary-loss-free strategy for load balancing, which aims to distribute computational load across experts without the performance degradation typically associated with auxiliary loss functions. Additionally, DeepSeek-V3 incorporates a Multi-Token Prediction (MTP) training objective, which densifies training signals and is observed to enhance overall model performance by training the model to predict multiple future tokens simultaneously. Training further leverages FP8 mixed precision, demonstrating its feasibility and effectiveness at an extremely large scale. The model employs Rotary Positional Embedding (RoPE) for handling positional information and RMSNorm for normalization within its layers.

DeepSeek-V3 is engineered to support a broad spectrum of general language tasks, exhibiting capabilities in areas such as mathematical problem-solving, advanced code development, and complex reasoning. Its design allows for the processing of extended contexts, supporting a context length of up to 128K tokens. This enables the model to handle long documents and complex multi-turn conversations effectively. The model's efficiency in both training and inference makes it suitable for applications requiring substantial computational capacity while maintaining resource optimization.

About DeepSeek-V3

DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.


Other DeepSeek-V3 Models

Evaluation Benchmarks

Rank

#53

BenchmarkScoreRank

0.32

🥈

2

0.976

4

General Knowledge

MMLU

0.885

6

0.953

9

0.806

12

0.55

20

0.439

27

Web Development

WebDev Arena

1358

36

Professional Knowledge

MMLU Pro

0.74

47

Rankings

Overall Rank

#53

Coding Rank

#79

Model Integrity

Total Score

B

68 / 100

DeepSeek-V3 671B Model Integrity Report

Total Score

68

/ 100

B

Audit Note

DeepSeek-V3 exhibits high transparency in its technical architecture and compute resources, providing a level of detail in its technical report that exceeds many proprietary competitors. Its primary transparency weaknesses lie in the lack of granular data provenance and the use of a custom, restrictive model license. While the model is highly verifiable through open weights, users should be mindful of rapid versioning cycles and the complexities of its multi-part licensing structure.

Upstream

21.5 / 30

Architectural Provenance

9.0 / 10

DeepSeek-V3 provides exemplary architectural transparency through a detailed technical report and open-source implementation. The model explicitly documents its use of Multi-head Latent Attention (MLA) for inference efficiency and the DeepSeekMoE architecture. It provides specific details on its novel auxiliary-loss-free load balancing strategy and Multi-Token Prediction (MTP) objective. The transition from previous versions (V2) is clearly documented, and the model's 61-layer decoder-only transformer structure is fully specified in both the paper and the public GitHub repository.

Dataset Composition

4.0 / 10

While the total token count (14.8 trillion) and the general nature of the data (diverse, high-quality, multilingual) are disclosed, there is a lack of granular detail regarding the specific dataset proportions or sources. The documentation mentions 'web, code, and math' but does not provide a percentage breakdown or specific filtering/cleaning methodologies beyond general claims of curation. No sample data or specific source lists are publicly available, making the composition difficult to verify independently.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via Hugging Face and GitHub, with a confirmed vocabulary size of 129,280 tokens. It uses a byte-level Byte-Pair Encoding (BPE) approach similar to the Llama tokenizer but with custom modifications for multilingual support (English and Chinese). Documentation includes specific special tokens for tool calling and reasoning blocks, and the vocabulary is consistent across the V3 family, ensuring predictable behavior for developers.

Model

28.0 / 40

Parameter Density

7.0 / 10

DeepSeek-V3 is transparent about its Mixture-of-Experts (MoE) structure, clearly stating a total of 671 billion parameters with 37 billion active parameters per token. The architectural breakdown (256 routed experts, 1 shared expert) is well-documented. However, it loses points because the 671B figure includes 14B parameters from the Multi-Token Prediction (MTP) module which are used during training but are optional/detachable during inference, leading to some minor ambiguity in 'total' vs 'inference' parameter counts in marketing materials.

Training Compute

8.0 / 10

The technical report provides unusually specific details on training compute, citing 2.788 million H800 GPU hours for the full training run. It discloses the hardware used (2,048 NVIDIA H800 GPUs), the training duration (approximately two months), and even provides a cost estimate (~$5.58 million). While it does not provide a formal carbon footprint calculation in the primary report, the level of compute transparency is significantly higher than most industry peers.

Benchmark Reproducibility

5.0 / 10

DeepSeek provides a comprehensive list of benchmark results (MMLU, GSM8K, HumanEval, etc.) in its technical report and GitHub. However, it lacks a unified, one-click reproduction script for all claimed figures. While evaluation settings (e.g., few-shot counts) are mentioned, the exact prompts and internal evaluation pipelines are not fully open-sourced, making exact bit-for-bit reproduction of scores challenging for independent auditors.

Identity Consistency

8.0 / 10

The model generally maintains a consistent identity as DeepSeek-V3 and correctly identifies its version and origin in most standard deployments. It is transparent about its nature as an AI and its MoE architecture. Some minor confusion has been noted in third-party agentic testing where the model occasionally struggles with self-awareness in complex scaffolds, but its core identity remains stable and verifiable through official API and weight metadata.

Downstream

18.5 / 30

License Clarity

6.0 / 10

The licensing is split: the code is under the permissive MIT license, but the model weights are governed by a custom 'DeepSeek Model License.' While this license explicitly allows commercial use and derivative works, it includes 'Use-based restrictions' and 'Accountability' clauses that are more restrictive than standard Open Source Initiative (OSI) licenses. The terms are public but create a more complex legal landscape than a standard Apache 2.0 or MIT license.

Hardware Footprint

7.5 / 10

Hardware requirements are well-documented by both the provider and the community. The technical report discusses the use of FP8 mixed precision, and official guides specify VRAM requirements for various configurations (e.g., ~700GB for FP8 inference). Third-party documentation (e.g., vLLM, SGLang) provides detailed quantization trade-offs (INT4, GGUF) and multi-node requirements, though the provider's own documentation could be more centralized regarding consumer-grade hardware limits.

Versioning Drift

5.0 / 10

DeepSeek maintains a public changelog for its API and releases versioned weights (e.g., V3-0324, V3.1). However, the rapid release cycle and 'silent' updates to the hosted API (deepseek-chat) have led to reports of behavioral drift. While semantic versioning is partially used, the deprecation of older versions (like the original V3) happens quickly, sometimes leaving users with limited paths for long-term stability on a specific checkpoint.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs

DeepSeek-V3 671B: Specifications and GPU VRAM Requirements