Active Parameters
671B
Context Length
128K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT License
Release Date
21 Aug 2025
Knowledge Cutoff
-
Total Expert Parameters
37.0B
Number of Experts
257
Active Experts
8
Attention Structure
Multi-Head Attention
Hidden Dimension Size
7168
Number of Layers
61
Attention Heads
-
Key-Value Heads
-
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
ROPE
VRAM requirements for different quantization methods and context sizes
A hybrid model that supports both "thinking" and "non-thinking" modes for chat, reasoning, and coding. It's a Mixture-of-Experts (MoE) model with a massive context length and efficient architecture.
DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.
Ranking is for Local LLMs.
Rank
#3
Benchmark | Score | Rank |
---|---|---|
General Knowledge MMLU | 0.94 | π₯ 1 |
Coding Aider Coding | 0.76 | π₯ 2 |
Professional Knowledge MMLU Pro | 0.85 | π₯ 2 |
Graduate-Level QA GPQA | 0.80 | π₯ 2 |
Overall Rank
#3 π₯
Coding Rank
#10
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens