ApX logoApX logo

DeepSeek-V3.1

Active Parameters

671B

Context Length

128K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT License

Release Date

21 Aug 2025

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

128

Key-Value Heads

128

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

10,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

7,168

Number of Layers

61

FFN Intermediate Size (Dense)

2,048

Multi-Token Prediction Heads

1

Tokenizer

Vocabulary Size

129,280

Mixture of Experts

Total Expert Parameters

37.0B

Number of Experts

257

Active Experts

8

Shared Experts

1

FFN Intermediate Size (per Expert)

2,048

Dense Layers Before MoE

3

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 7.2k · Context: 128K · Vocab: 129.3kx 61 layersRMSNormPre-AttentionMulti-Head Attention128Q / 128KV headsHead dim: 56+RMSNormPre-FFNSparse MoE FFN (8/257 experts)SwiGLUIntermediate: 2k+Final RMSNormOutput Logits

DeepSeek-V3.1

A hybrid model that supports both "thinking" and "non-thinking" modes for chat, reasoning, and coding. It's a Mixture-of-Experts (MoE) model with a massive context length and efficient architecture.

About DeepSeek-V3

DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.


Other DeepSeek-V3 Models

Evaluation Benchmarks

Rank

#99

BenchmarkScoreRank

Agentic Coding

LiveBench Agentic

0.47

24

0.481

24

Web Development

WebDev Arena

1418

29

General Text

Text Arena

1418

45

Professional Knowledge

MMLU Pro

0.84

55

Rankings

Overall Rank

#99

Coding Rank

#68

Model Integrity

Total Score

B

68 / 100

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs

DeepSeek-V3.1: Specifications and GPU VRAM Requirements