ApX logo

DeepSeek-V3.1

Active Parameters

671B

Context Length

128K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT License

Release Date

21 Aug 2025

Knowledge Cutoff

-

Technical Specifications

Total Expert Parameters

37.0B

Number of Experts

257

Active Experts

8

Attention Structure

Multi-Head Attention

Hidden Dimension Size

7168

Number of Layers

61

Attention Heads

-

Key-Value Heads

-

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

DeepSeek-V3.1

A hybrid model that supports both "thinking" and "non-thinking" modes for chat, reasoning, and coding. It's a Mixture-of-Experts (MoE) model with a massive context length and efficient architecture.

About DeepSeek-V3

DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.


Other DeepSeek-V3 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

Rank

#3

BenchmarkScoreRank

General Knowledge

MMLU

0.94

πŸ₯‡

1

0.76

πŸ₯ˆ

2

Professional Knowledge

MMLU Pro

0.85

πŸ₯ˆ

2

Graduate-Level QA

GPQA

0.80

πŸ₯ˆ

2

Rankings

Overall Rank

#3 πŸ₯‰

Coding Rank

#10

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs