ApX logoApX logo

DeepSeek-V3.1

Active Parameters

671B

Context Length

128K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT License

Release Date

21 Aug 2025

Knowledge Cutoff

-

Technical Specifications

Total Expert Parameters

37.0B

Number of Experts

257

Active Experts

8

Attention Structure

Multi-Head Attention

Hidden Dimension Size

7168

Number of Layers

61

Attention Heads

-

Key-Value Heads

-

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

ROPE

DeepSeek-V3.1

A hybrid model that supports both "thinking" and "non-thinking" modes for chat, reasoning, and coding. It's a Mixture-of-Experts (MoE) model with a massive context length and efficient architecture.

About DeepSeek-V3

DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.


Other DeepSeek-V3 Models

Evaluation Benchmarks

Rank

#3

BenchmarkScoreRank

Professional Knowledge

MMLU Pro

0.84

4

Web Development

WebDev Arena

1418

14

Rankings

Overall Rank

#3 🥉

Coding Rank

#22

Model Transparency

Total Score

B

68 / 100

DeepSeek-V3.1 Transparency Report

Total Score

68

/ 100

B

Audit Note

DeepSeek-V3.1 exhibits high transparency regarding its MoE architecture and training compute efficiency, providing technical details rarely seen in models of this scale. However, significant opacity remains concerning the specific composition of its 14.8T token training set and the reproducibility of its latest hybrid-mode benchmarks. The model's permissive licensing and clear self-identity are strong points, but the 'silent' nature of its updates complicates long-term reliability tracking.

Upstream

21.0 / 30

Architectural Provenance

8.0 / 10

DeepSeek-V3.1 is built upon the DeepSeek-V3 architecture, which is extensively documented in a 52-page technical report. It utilizes a Mixture-of-Experts (MoE) framework with Multi-head Latent Attention (MLA) and an auxiliary-loss-free load balancing strategy. The V3.1 variant specifically introduces a 'hybrid reasoning' capability, allowing the model to toggle between standard and chain-of-thought modes via chat templates. While the base architecture is highly transparent, the specific 'hybrid' training delta for V3.1 is less detailed than the original V3 pre-training documentation.

Dataset Composition

4.0 / 10

The model was trained on 14.8 trillion tokens for the base version, with V3.1 receiving an additional 840 billion tokens for long-context extension (32K and 128K phases). However, the specific sources of this data remain largely undisclosed beyond general categories like 'diverse high-quality data' and 'web, code, and math.' There is no public breakdown of dataset proportions (e.g., % CommonCrawl vs % GitHub) or specific filtering/cleaning code, which is a significant gap in upstream transparency.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly available on Hugging Face with a vocabulary size of 129,280 tokens. It uses a byte-level BPE approach and includes specific special tokens for the 'thinking' mode (e.g., <think> and </think>). The configuration files are fully accessible, allowing for independent verification of tokenization behavior and alignment with claimed language support (primarily English and Chinese).

Model

29.0 / 40

Parameter Density

7.0 / 10

DeepSeek-V3.1 is transparent about its MoE structure, disclosing a total of 671 billion parameters with 37 billion active parameters per token. The architectural breakdown (61 layers, 256 experts per layer, and 1 shared expert) is clearly stated in technical documentation. However, some third-party reports cite 685B total parameters (including MTP modules), creating slight confusion that requires careful reading of the technical report to resolve.

Training Compute

8.0 / 10

The provider offers unusually detailed compute metrics for a model of this scale. The technical report states the pre-training required 2.664M H800 GPU hours, with an additional 119K for context extension and 5K for post-training, totaling approximately 2.788M hours. They also provide an estimated training cost of $5.6M. While hardware specs are clear (H800 clusters), a formal carbon footprint calculation is missing.

Benchmark Reproducibility

5.0 / 10

While DeepSeek provides extensive benchmark results (MMLU, MATH, HumanEval) and some evaluation scripts on GitHub, third-party reproduction has shown inconsistent results. For instance, community tests on LiveCodeBench for V3.1-Base reported significantly lower scores than official claims. The lack of a unified, one-click reproduction suite for the V3.1 specific 'hybrid' benchmarks limits full verifiability.

Identity Consistency

9.0 / 10

The model consistently identifies itself as DeepSeek-V3.1 and is transparent about its dual-mode capabilities (thinking vs. non-thinking). It correctly handles system prompts to switch between these identities and does not exhibit the identity confusion common in models that are heavily distilled from competitors. Versioning is clearly maintained through Hugging Face and API endpoints.

Downstream

18.0 / 30

License Clarity

7.0 / 10

The model weights are released under the MIT License, which is highly permissive for both commercial and non-commercial use. However, there is some ambiguity regarding the 'DeepSeek Model License' mentioned in some repositories, which can conflict with the MIT header on Hugging Face. The terms for derivative works and output usage are generally clear but require cross-referencing multiple documents.

Hardware Footprint

6.0 / 10

VRAM requirements are documented for various quantization levels (FP8, BF16), with clear guidance that a full BF16 deployment requires significant resources (approx. 1.3TB VRAM). While community tools like Unsloth provide additional guidance for consumer hardware (e.g., 226GB for 2-bit), the official documentation focuses primarily on enterprise-grade H800/A100 clusters, leaving a gap for smaller-scale users.

Versioning Drift

5.0 / 10

DeepSeek uses a versioning system (V3 -> V3-0324 -> V3.1), but the release of V3.1 was described as a 'silent launch' without a formal changelog or detailed migration guide. While weights are versioned on Hugging Face, there is limited documentation on how the model's behavior drifts over time due to the frequent, unannounced updates to the hosted API endpoints.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs

DeepSeek-V3.1: Specifications and GPU VRAM Requirements