ApX logoApX logo

DeepSeek-V3.2

Active Parameters

671B

Context Length

128K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT

Release Date

10 Jan 2026

Knowledge Cutoff

May 2025

Technical Specifications

Attention

Attention Structure

DeepSeek Sparse Attention

Attention Heads

128

Key-Value Heads

1

Attention Head Dimension

-

Position Embedding

Absolute Position Embedding

RoPE Theta

10,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

7,168

Number of Layers

61

FFN Intermediate Size (Dense)

2,048

Multi-Token Prediction Heads

1

Tokenizer

Vocabulary Size

129,280

Mixture of Experts

Total Expert Parameters

37.0B

Number of Experts

257

Active Experts

9

Shared Experts

1

FFN Intermediate Size (per Expert)

2,048

Dense Layers Before MoE

3

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 7.2k · Context: 128k · Vocab: 129.3kx 61 layersRMSNormPre-AttentionDeepSeek Sparse Attention128Q / 1KV headsHead dim: 56+RMSNormPre-FFNSparse MoE FFN (9/257 experts)SwiGLUIntermediate: 2k+Final RMSNormOutput Logits

DeepSeek-V3.2

DeepSeek-V3.2 represents an evolution in the deployment of large-scale Mixture-of-Experts (MoE) architectures, specifically optimized for agentic workflows and advanced reasoning tasks. The model utilizes 671 billion total parameters, but maintains a highly efficient inference profile by activating only 37 billion parameters for any given token. This sparse activation strategy allows the model to achieve the representational capacity of a trillion-parameter class model while maintaining the computational overhead and latency characteristic of much smaller dense architectures. The training objective incorporates a Multi-Token Prediction (MTP) strategy, which densifies training signals and improves the model's ability to plan subsequent outputs in complex sequences.

The architectural foundation of DeepSeek-V3.2 is built upon DeepSeek Sparse Attention (DSA), a technical advancement over the previous Multi-head Latent Attention (MLA). DSA further optimizes memory utilization and throughput by employing a low-rank compression of Key-Value (KV) caches, effectively mitigating the memory bottlenecks typically encountered in long-context generation. The model also features an auxiliary-loss-free load balancing mechanism, which ensures high expert utilization without the performance trade-offs commonly associated with traditional load-balancing penalties. This is achieved through a dynamic bias adjustment that routes tokens based on real-time affinity scores across 256 routed experts and one shared expert.

Functionally, DeepSeek-V3.2 is designed to serve as a high-performance foundation for autonomous agents and complex problem-solving environments. It integrates a 'thinking' mode directly into tool-use scenarios, allowing for multi-step reasoning before executing external function calls. With a context window of 163,840 tokens and a training corpus comprising 14.8 trillion high-quality tokens, the model is suited for enterprise-grade applications requiring deep mathematical reasoning, competitive programming proficiency, and reliable multilingual generation. The release is governed by the MIT license, permitting broad use across both academic research and commercial production environments.

About DeepSeek-V3

DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671B parameters with 37B activated per token. Its architecture incorporates Multi-head Latent Attention and DeepSeekMoE for efficient inference and training. Innovations include an auxiliary-loss-free load balancing strategy and a multi-token prediction objective, trained on 14.8T tokens.


Other DeepSeek-V3 Models

Evaluation Benchmarks

Rank

#85

BenchmarkScoreRank

0.70

11

0.76

19

Agentic Coding

LiveBench Agentic

0.47

24

Professional Knowledge

MMLU Pro

0.83

27

Graduate-Level QA

GPQA

0.799

29

0.44

46

0.64

47

Web Development

WebDev Arena

1330

48

0.45

51

Rankings

Overall Rank

#85

Coding Rank

#35

Model Integrity

Total Score

B+

80 / 100

DeepSeek-V3.2 Model Integrity Report

Total Score

80

/ 100

B+

Audit Note

DeepSeek-V3.2 exhibits a high level of technical transparency, particularly regarding its complex Mixture-of-Experts architecture and training compute efficiency. The model's use of a permissive MIT license and detailed disclosure of active vs. total parameters sets a strong industry standard for open-weights models. However, like many frontier models, it maintains significant opacity regarding the specific composition and sourcing of its massive 14.8 trillion token training corpus.

Upstream

22.0 / 30

Architectural Provenance

8.5 / 10

DeepSeek-V3.2 is extensively documented through a technical report and model cards. It explicitly builds on the DeepSeek-V3/V3.1-Terminus architecture, utilizing a Mixture-of-Experts (MoE) framework with 671B total and 37B active parameters. Key architectural innovations like DeepSeek Sparse Attention (DSA) and Multi-head Latent Attention (MLA) are detailed with mathematical formulations and diagrams in the official paper. The training methodology, including the use of Multi-Token Prediction (MTP) and an auxiliary-loss-free load balancing mechanism, is publicly described.

Dataset Composition

4.5 / 10

While the total token count (14.8 trillion) and general categories (web pages, e-books, code, math) are disclosed, the specific proportions of the dataset are not provided in detail. The documentation mentions 'enhancing the ratio' of math and code but lacks a granular percentage breakdown. Information on the '128K long context extension data' is described as 'aligned' with previous versions but lacks specific source disclosure. The model uses a 'Large-Scale Agentic Task Synthesis Pipeline' for post-training, which is documented as a methodology, but the underlying raw data sources remain largely opaque.

Tokenizer Integrity

9.0 / 10

The model uses the same tokenizer as DeepSeek-V3, which is publicly accessible via Hugging Face (LlamaTokenizerFast). The vocabulary size and special tokens (e.g., for tool use and reasoning blocks) are clearly defined in the tokenizer_config.json. Documentation explicitly notes the stability of the tokenizer across versions V3 to V3.2, and the 128K context window alignment is verified through public configuration files.

Model

33.0 / 40

Parameter Density

9.5 / 10

DeepSeek provides exemplary transparency regarding its MoE architecture. It clearly distinguishes between total parameters (671B) and active parameters (37B). The breakdown of experts (256 routed experts + 1 shared expert) is explicitly stated. This level of detail prevents the common 'parameter inflation' seen in other MoE models and allows for accurate computational modeling by third parties.

Training Compute

8.0 / 10

The technical report provides specific compute metrics, stating the use of 2,048 NVIDIA H800 GPUs over approximately two months. Total training compute is disclosed as 2.788 million GPU hours, with a cost estimate of ~$5.576M based on rental rates. While it lacks a formal third-party carbon audit, the provided energy consumption data (180K GPU hours per trillion tokens) allows for independent environmental impact calculations.

Benchmark Reproducibility

6.5 / 10

DeepSeek discloses scores across a wide array of standard benchmarks (MMLU-Pro, AIME 2025, SWE-Bench) and provides some evaluation scripts and chat templates on GitHub. However, the full evaluation pipeline and exact few-shot prompts for all benchmarks are not as comprehensively documented as the architecture. There is a noted discrepancy between internal math scores (~89% AIME) and some independent estimates, though the model's performance is generally verifiable through public leaderboards like LMSYS.

Identity Consistency

9.0 / 10

The model consistently identifies itself as a DeepSeek-developed AI and maintains version awareness (V3.2). It is transparent about its 'thinking' vs 'non-thinking' modes and the specific limitations of variants like 'Speciale' (which lacks tool-calling). There are no documented instances of the model claiming to be a competitor's product or denying its nature as an AI.

Downstream

25.0 / 30

License Clarity

9.5 / 10

The model is released under the MIT License, which is one of the most permissive and transparent open-source licenses available. The license terms are clearly stated on GitHub and Hugging Face, explicitly permitting commercial use, modification, and redistribution without the restrictive 'acceptable use' clauses often found in other 'open' weights models.

Hardware Footprint

8.5 / 10

Hardware requirements are well-documented for various deployment scenarios. Official and community guides (vLLM, SGLang) provide specific VRAM requirements for FP16 (~1.5TB) and 4-bit quantization (~350-400GB). The impact of context length on memory scaling is addressed, and the documentation provides clear guidance on multi-GPU configurations (e.g., 8x or 16x 80GB GPUs) required for efficient inference.

Versioning Drift

7.0 / 10

DeepSeek maintains a clear versioning history (V3 -> V3.1 -> V3.2) with associated technical updates for each. While it lacks a formal, real-time 'drift' dashboard, the release of specific checkpoints (e.g., 0324, 0528) and the use of semantic-style versioning allow developers to track changes. The transition from experimental (Exp) to stable releases is documented, though detailed changelogs for minor weight updates could be more granular.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs

DeepSeek-V3.2: Specifications and GPU VRAM Requirements