Active Parameters
229B
Context Length
128K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
7 Nov 2025
Knowledge Cutoff
Jun 2024
Attention
Attention Structure
Multi-Head Attention
Attention Heads
32
Key-Value Heads
8
Attention Head Dimension
128
Position Embedding
Absolute Position Embedding
RoPE Theta
5,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
4,096
Number of Layers
32
FFN Intermediate Size (Dense)
1,536
Multi-Token Prediction Heads
3
Tokenizer
Vocabulary Size
200,064
Mixture of Experts
Total Expert Parameters
10.0B
Number of Experts
8
Active Experts
2
Shared Experts
-
FFN Intermediate Size (per Expert)
1,536
Dense Layers Before MoE
-
MiniMax M2 is a sparse Mixture of Experts (MoE) transformer model engineered by MiniMax for high-efficiency performance in complex coding and agentic workflows. By utilizing a total parameter count of 229 billion while only activating approximately 10 billion parameters per token during inference, the architecture achieves a high ratio of stored knowledge to computational throughput. This design permits the model to handle long-horizon tasks such as multi-file repository editing and iterative code-run-fix loops with the latency profiles typically associated with much smaller dense models.
The model's technical foundation is built on a full-attention mechanism that incorporates Rotary Position Embeddings (RoPE) for stable long-context handling. It utilizes Root Mean Square Layer Normalization (RMSNorm) and the SiLU (Swiglu) activation function to ensure training stability and representational efficiency. Architecturally, it features 32 hidden layers with a hidden dimension of 4096, employing a Top-2 routing strategy to distribute workloads across its internal expert modules. The integration of a 128,000-token context window supports the ingestion of large technical documents and extensive codebases, facilitating consistent reasoning over deep information hierarchies.
Optimized for autonomous agent environments, MiniMax M2 provides native support for external tool integration through a structured reasoning trace system. The model maintains internal decision-making logs between turns, which allows it to recover from execution errors in shell environments or web-browsing tasks. Its efficient inference footprint makes it a candidate for deployment in continuous integration pipelines and integrated development environments where fast feedback cycles and low operational costs are required.
MiniMax's efficient MoE models built for coding and agentic workflows.
Rank
#128
| Benchmark | Score | Rank |
|---|---|---|
StackEval ProLLM Stack Eval | 0.96 | 8 |
StackUnseen ProLLM Stack Unseen | 0.66 | 17 |
Summarization ProLLM Summarization | 0.739 | 20 |
Graduate-Level QA GPQA | 0.78 | 31 |
Professional Knowledge MMLU Pro | 0.82 | 57 |
General Text Text Arena | 1346 | 70 |
Web Development WebDev Arena | 1305 | 78 |
Overall Rank
#128
Coding Rank
#98
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online