Active Parameters
229B
Context Length
128K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
7 Nov 2025
Knowledge Cutoff
Jun 2024
Total Expert Parameters
10.0B
Number of Experts
8
Active Experts
2
Attention Structure
Multi-Head Attention
Hidden Dimension Size
4096
Number of Layers
32
Attention Heads
32
Key-Value Heads
8
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
MiniMax M2 is a sparse Mixture of Experts (MoE) transformer model engineered by MiniMax for high-efficiency performance in complex coding and agentic workflows. By utilizing a total parameter count of 229 billion while only activating approximately 10 billion parameters per token during inference, the architecture achieves a high ratio of stored knowledge to computational throughput. This design permits the model to handle long-horizon tasks such as multi-file repository editing and iterative code-run-fix loops with the latency profiles typically associated with much smaller dense models.
The model's technical foundation is built on a full-attention mechanism that incorporates Rotary Position Embeddings (RoPE) for stable long-context handling. It utilizes Root Mean Square Layer Normalization (RMSNorm) and the SiLU (Swiglu) activation function to ensure training stability and representational efficiency. Architecturally, it features 32 hidden layers with a hidden dimension of 4096, employing a Top-2 routing strategy to distribute workloads across its internal expert modules. The integration of a 128,000-token context window supports the ingestion of large technical documents and extensive codebases, facilitating consistent reasoning over deep information hierarchies.
Optimized for autonomous agent environments, MiniMax M2 provides native support for external tool integration through a structured reasoning trace system. The model maintains internal decision-making logs between turns, which allows it to recover from execution errors in shell environments or web-browsing tasks. Its efficient inference footprint makes it a candidate for deployment in continuous integration pipelines and integrated development environments where fast feedback cycles and low operational costs are required.
MiniMax's efficient MoE models built for coding and agentic workflows.
Rank
#59
| Benchmark | Score | Rank |
|---|---|---|
StackEval ProLLM Stack Eval | 0.96 | 6 |
Professional Knowledge MMLU Pro | 0.82 | 10 |
Graduate-Level QA GPQA | 0.78 | 21 |
Web Development WebDev Arena | 1347 | 29 |
Overall Rank
#59
Coding Rank
#72
Total Score
63
/ 100
MiniMax M2 exhibits a bifurcated transparency profile, offering high clarity on its sparse MoE architecture and hardware requirements while remaining almost entirely opaque regarding its training data and compute resources. The model's commitment to open weights under a permissive license and its consistent self-identification are significant strengths. However, the absence of a detailed technical report and the reliance on undisclosed datasets for a model of this scale represent critical transparency risks for enterprise and research adoption.
Architectural Provenance
MiniMax M2 is explicitly documented as a sparse Mixture of Experts (MoE) transformer. Technical details are available in official blog posts and GitHub documentation, specifying 32 hidden layers, a hidden dimension of 4096, and a Top-2 routing strategy. It utilizes standard components like Rotary Position Embeddings (RoPE), RMSNorm, and SiLU (Swiglu) activation. While the base architecture is well-described, a formal peer-reviewed technical paper detailing the specific pre-training methodology is absent, though references to the 'CISPO' reinforcement learning algorithm from the M1 predecessor are provided.
Dataset Composition
There is almost no public information regarding the specific training data sources or composition. Official sources state the model was 'trained on a sparse dataset' and mention 'reinforcement learning in hundreds of thousands of complex real-world environments,' but provide no breakdown of web, code, or book data proportions. No documentation exists for data filtering, cleaning, or collection methodologies, which is a significant transparency gap.
Tokenizer Integrity
The tokenizer is publicly accessible via the Hugging Face repository and integrated into the 'transformers' library. The vocabulary size is explicitly stated as 200,064 tokens. The tokenizer approach is verifiable through the provided 'tokenizer.json' and 'merges.txt' files, and it supports the claimed multilingual and coding-specific tokenization requirements.
Parameter Density
MiniMax provides clear and consistent disclosure regarding parameter density. The model is documented as having 229-230 billion total parameters with approximately 10 billion active parameters per token during inference. This 23:1 sparsity ratio is a central part of their technical communication, and the architectural breakdown (layers, hidden dimensions, expert count) is provided in the model configuration files.
Training Compute
No specific information is provided regarding the total GPU/TPU hours, hardware cluster specifications used for training, or the carbon footprint. While the company mentions the model is 'efficient' to train compared to dense models, there are no verifiable metrics or environmental impact data available to the public.
Benchmark Reproducibility
The model provides results for several standard benchmarks (SWE-bench Verified, Terminal-Bench, BrowseComp) and introduces a new benchmark called 'VIBE'. While some evaluation methodology notes are provided (e.g., using Claude Code as a scaffold, 128k context length), the exact evaluation code and full prompt sets for all benchmarks are not fully public, and some results rely on 'internal infrastructure' or 'internal benchmarks' like OctoCodingBench.
Identity Consistency
The model demonstrates strong identity consistency, correctly identifying itself as 'MiniMax-M2' or 'MiniMax-M2.1' in system prompts and documentation. It is transparent about its nature as an AI built by MiniMax and its specific optimization for coding and agentic tasks. Versioning is clearly maintained between the M2, M2.1, and M2.5 releases.
License Clarity
The model is released under the MIT License (or a 'Modified-MIT' license in some repository tags), which is a permissive open-source license. The terms are clearly stated in the GitHub repository and Hugging Face model cards, explicitly allowing for commercial use. There is minor ambiguity in the 'Modified-MIT' naming in some metadata, but the actual license text provided is standard MIT.
Hardware Footprint
Hardware requirements are well-documented by both the provider and community partners (vLLM, Novita AI). VRAM requirements for different quantization levels (FP16, INT8, INT4) are provided, ranging from ~460GB for FP16 to ~115-130GB for 4-bit. Context length memory scaling is also addressed, with specific GPU cluster recommendations (e.g., 4x H100 for FP8) for various deployment scenarios.
Versioning Drift
MiniMax maintains a clear versioning history (M2 -> M2.1 -> M2.5) with associated changelogs highlighting improvements in code quality and instruction following. However, the 'silent' nature of some updates and the lack of a detailed, granular version history for the weights themselves (outside of major releases) prevents a higher score.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens