ApX logoApX logo

DeepSeek-V4-Pro

Active Parameters

1.6T

Context Length

1,000K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT

Release Date

24 Apr 2026

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

128

Key-Value Heads

1

Attention Head Dimension

512

Position Embedding

Absolute Position Embedding

RoPE Theta

10,000

Sliding Window Attention

Yes

Sliding Window Size

128

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

7,168

Number of Layers

61

FFN Intermediate Size (Dense)

3,072

Multi-Token Prediction Heads

1

Tokenizer

Vocabulary Size

129,280

Mixture of Experts

Total Expert Parameters

49.0B

Number of Experts

384

Active Experts

6

Shared Experts

1

FFN Intermediate Size (per Expert)

3,072

Dense Layers Before MoE

-

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 7.2k · Context: 1,000k · Vocab: 129.3kx 61 layersRMSNormPre-AttentionMulti-Head Attention128Q / 1KV heads · SW: 128Head dim: 512+RMSNormPre-FFNSparse MoE FFN (6/384 experts)SwiGLUIntermediate: 3.1k+Final RMSNormOutput Logits

DeepSeek-V4-Pro

DeepSeek-V4-Pro is DeepSeek's flagship open-source model with 1.6T total parameters and 49B activated per token. Features a novel hybrid CSA+HCA attention mechanism that achieves 1M context with only 27% of the inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2. In Think Max mode (DeepSeek-V4-Pro-Max), it achieves state-of-the-art open-source results: SWE-Bench Verified 80.6%, SWE-Bench Pro 55.4%, Terminal-Bench 2.0 67.9%, MRCR 1M 83.5%, GPQA Diamond 90.1%, LiveCodeBench 93.5%, and Codeforces Rating 3206. Supports Non-think, Think High, and Think Max reasoning modes. Available via API as deepseek-v4-pro. Released open-source under MIT license on April 24, 2026.

About DeepSeek V4

DeepSeek-V4 is DeepSeek's latest generation of highly efficient Mixture-of-Experts language models, featuring a novel hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) that dramatically improves long-context efficiency. Pre-trained on 32T+ tokens with a comprehensive post-training pipeline including domain-specific expert cultivation and unified model consolidation. Both V4-Pro and V4-Flash support 1M context length as standard, with three reasoning effort modes (Non-think, Think High, Think Max). Released open-source under MIT license on April 24, 2026.


Other DeepSeek V4 Models

Evaluation Benchmarks

Rank

#76

No evaluation benchmarks for DeepSeek-V4-Pro available.

Rankings

Overall Rank

#76

Coding Rank

-

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
488k
977k

VRAM Required:

Recommended GPUs

DeepSeek-V4-Pro: Specifications and GPU VRAM Requirements