Active Parameters
671B
Context Length
128K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT License
Release Date
21 Aug 2025
Knowledge Cutoff
-
Attention
Attention Structure
Multi-Head Attention
Attention Heads
128
Key-Value Heads
128
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
10,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
7,168
Number of Layers
61
FFN Intermediate Size (Dense)
2,048
Multi-Token Prediction Heads
1
Tokenizer
Vocabulary Size
129,280
Mixture of Experts
Total Expert Parameters
37.0B
Number of Experts
257
Active Experts
8
Shared Experts
1
FFN Intermediate Size (per Expert)
2,048
Dense Layers Before MoE
3
A hybrid model that supports both "thinking" and "non-thinking" modes for chat, reasoning, and coding. It's a Mixture-of-Experts (MoE) model with a massive context length and efficient architecture.
DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.
Rank
#99
| Benchmark | Score | Rank |
|---|---|---|
Agentic Coding LiveBench Agentic | 0.47 | 24 |
StackUnseen ProLLM Stack Unseen | 0.481 | 24 |
Web Development WebDev Arena | 1418 | 29 |
General Text Text Arena | 1418 | 45 |
Professional Knowledge MMLU Pro | 0.84 | 55 |
Overall Rank
#99
Coding Rank
#68
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online