ApX logoApX logo

MiniMax M2

Active Parameters

229B

Context Length

128K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT

Release Date

7 Nov 2025

Knowledge Cutoff

Jun 2024

Technical Specifications

Total Expert Parameters

10.0B

Number of Experts

8

Active Experts

2

Attention Structure

Multi-Head Attention

Hidden Dimension Size

4096

Number of Layers

32

Attention Heads

32

Key-Value Heads

8

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

MiniMax M2

MiniMax M2 is a sparse Mixture of Experts (MoE) transformer model engineered by MiniMax for high-efficiency performance in complex coding and agentic workflows. By utilizing a total parameter count of 229 billion while only activating approximately 10 billion parameters per token during inference, the architecture achieves a high ratio of stored knowledge to computational throughput. This design permits the model to handle long-horizon tasks such as multi-file repository editing and iterative code-run-fix loops with the latency profiles typically associated with much smaller dense models.

The model's technical foundation is built on a full-attention mechanism that incorporates Rotary Position Embeddings (RoPE) for stable long-context handling. It utilizes Root Mean Square Layer Normalization (RMSNorm) and the SiLU (Swiglu) activation function to ensure training stability and representational efficiency. Architecturally, it features 32 hidden layers with a hidden dimension of 4096, employing a Top-2 routing strategy to distribute workloads across its internal expert modules. The integration of a 128,000-token context window supports the ingestion of large technical documents and extensive codebases, facilitating consistent reasoning over deep information hierarchies.

Optimized for autonomous agent environments, MiniMax M2 provides native support for external tool integration through a structured reasoning trace system. The model maintains internal decision-making logs between turns, which allows it to recover from execution errors in shell environments or web-browsing tasks. Its efficient inference footprint makes it a candidate for deployment in continuous integration pipelines and integrated development environments where fast feedback cycles and low operational costs are required.

About MiniMax M2

MiniMax's efficient MoE models built for coding and agentic workflows.


Other MiniMax M2 Models
  • No related models available

Evaluation Benchmarks

Rank

#59

BenchmarkScoreRank

0.96

6

Professional Knowledge

MMLU Pro

0.82

10

Graduate-Level QA

GPQA

0.78

21

Web Development

WebDev Arena

1347

29

Rankings

Overall Rank

#59

Coding Rank

#72

Model Transparency

Total Score

B-

63 / 100

MiniMax M2 Transparency Report

Total Score

63

/ 100

B-

Audit Note

MiniMax M2 exhibits a bifurcated transparency profile, offering high clarity on its sparse MoE architecture and hardware requirements while remaining almost entirely opaque regarding its training data and compute resources. The model's commitment to open weights under a permissive license and its consistent self-identification are significant strengths. However, the absence of a detailed technical report and the reliance on undisclosed datasets for a model of this scale represent critical transparency risks for enterprise and research adoption.

Upstream

17.5 / 30

Architectural Provenance

7.0 / 10

MiniMax M2 is explicitly documented as a sparse Mixture of Experts (MoE) transformer. Technical details are available in official blog posts and GitHub documentation, specifying 32 hidden layers, a hidden dimension of 4096, and a Top-2 routing strategy. It utilizes standard components like Rotary Position Embeddings (RoPE), RMSNorm, and SiLU (Swiglu) activation. While the base architecture is well-described, a formal peer-reviewed technical paper detailing the specific pre-training methodology is absent, though references to the 'CISPO' reinforcement learning algorithm from the M1 predecessor are provided.

Dataset Composition

2.0 / 10

There is almost no public information regarding the specific training data sources or composition. Official sources state the model was 'trained on a sparse dataset' and mention 'reinforcement learning in hundreds of thousands of complex real-world environments,' but provide no breakdown of web, code, or book data proportions. No documentation exists for data filtering, cleaning, or collection methodologies, which is a significant transparency gap.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the Hugging Face repository and integrated into the 'transformers' library. The vocabulary size is explicitly stated as 200,064 tokens. The tokenizer approach is verifiable through the provided 'tokenizer.json' and 'merges.txt' files, and it supports the claimed multilingual and coding-specific tokenization requirements.

Model

24.0 / 40

Parameter Density

9.0 / 10

MiniMax provides clear and consistent disclosure regarding parameter density. The model is documented as having 229-230 billion total parameters with approximately 10 billion active parameters per token during inference. This 23:1 sparsity ratio is a central part of their technical communication, and the architectural breakdown (layers, hidden dimensions, expert count) is provided in the model configuration files.

Training Compute

1.0 / 10

No specific information is provided regarding the total GPU/TPU hours, hardware cluster specifications used for training, or the carbon footprint. While the company mentions the model is 'efficient' to train compared to dense models, there are no verifiable metrics or environmental impact data available to the public.

Benchmark Reproducibility

5.0 / 10

The model provides results for several standard benchmarks (SWE-bench Verified, Terminal-Bench, BrowseComp) and introduces a new benchmark called 'VIBE'. While some evaluation methodology notes are provided (e.g., using Claude Code as a scaffold, 128k context length), the exact evaluation code and full prompt sets for all benchmarks are not fully public, and some results rely on 'internal infrastructure' or 'internal benchmarks' like OctoCodingBench.

Identity Consistency

9.0 / 10

The model demonstrates strong identity consistency, correctly identifying itself as 'MiniMax-M2' or 'MiniMax-M2.1' in system prompts and documentation. It is transparent about its nature as an AI built by MiniMax and its specific optimization for coding and agentic tasks. Versioning is clearly maintained between the M2, M2.1, and M2.5 releases.

Downstream

21.5 / 30

License Clarity

8.0 / 10

The model is released under the MIT License (or a 'Modified-MIT' license in some repository tags), which is a permissive open-source license. The terms are clearly stated in the GitHub repository and Hugging Face model cards, explicitly allowing for commercial use. There is minor ambiguity in the 'Modified-MIT' naming in some metadata, but the actual license text provided is standard MIT.

Hardware Footprint

7.5 / 10

Hardware requirements are well-documented by both the provider and community partners (vLLM, Novita AI). VRAM requirements for different quantization levels (FP16, INT8, INT4) are provided, ranging from ~460GB for FP16 to ~115-130GB for 4-bit. Context length memory scaling is also addressed, with specific GPU cluster recommendations (e.g., 4x H100 for FP8) for various deployment scenarios.

Versioning Drift

6.0 / 10

MiniMax maintains a clear versioning history (M2 -> M2.1 -> M2.5) with associated changelogs highlighting improvements in code quality and instruction following. However, the 'silent' nature of some updates and the lack of a detailed, granular version history for the weights themselves (outside of major releases) prevents a higher score.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs