ApX logoApX logo

MiniMax M2

Active Parameters

229B

Context Length

128K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT

Release Date

7 Nov 2025

Knowledge Cutoff

Jun 2024

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

32

Key-Value Heads

8

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

5,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

32

FFN Intermediate Size (Dense)

1,536

Multi-Token Prediction Heads

3

Tokenizer

Vocabulary Size

200,064

Mixture of Experts

Total Expert Parameters

10.0B

Number of Experts

8

Active Experts

2

Shared Experts

-

FFN Intermediate Size (per Expert)

1,536

Dense Layers Before MoE

-

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 4.1k · Context: 128K · Vocab: 200.1kx 32 layersRMSNormPre-AttentionMulti-Head Attention32Q / 8KV headsHead dim: 128+RMSNormPre-FFNSparse MoE FFN (2/8 experts)SwiGLUIntermediate: 1.5k+Final RMSNormOutput Logits

MiniMax M2

MiniMax M2 is a sparse Mixture of Experts (MoE) transformer model engineered by MiniMax for high-efficiency performance in complex coding and agentic workflows. By utilizing a total parameter count of 229 billion while only activating approximately 10 billion parameters per token during inference, the architecture achieves a high ratio of stored knowledge to computational throughput. This design permits the model to handle long-horizon tasks such as multi-file repository editing and iterative code-run-fix loops with the latency profiles typically associated with much smaller dense models.

The model's technical foundation is built on a full-attention mechanism that incorporates Rotary Position Embeddings (RoPE) for stable long-context handling. It utilizes Root Mean Square Layer Normalization (RMSNorm) and the SiLU (Swiglu) activation function to ensure training stability and representational efficiency. Architecturally, it features 32 hidden layers with a hidden dimension of 4096, employing a Top-2 routing strategy to distribute workloads across its internal expert modules. The integration of a 128,000-token context window supports the ingestion of large technical documents and extensive codebases, facilitating consistent reasoning over deep information hierarchies.

Optimized for autonomous agent environments, MiniMax M2 provides native support for external tool integration through a structured reasoning trace system. The model maintains internal decision-making logs between turns, which allows it to recover from execution errors in shell environments or web-browsing tasks. Its efficient inference footprint makes it a candidate for deployment in continuous integration pipelines and integrated development environments where fast feedback cycles and low operational costs are required.

About MiniMax M2

MiniMax's efficient MoE models built for coding and agentic workflows.


Other MiniMax M2 Models
  • No related models available

Evaluation Benchmarks

Rank

#128

BenchmarkScoreRank

0.96

8

0.66

17

0.739

20

Graduate-Level QA

GPQA

0.78

31

Professional Knowledge

MMLU Pro

0.82

57

General Text

Text Arena

1346

70

Web Development

WebDev Arena

1305

78

Rankings

Overall Rank

#128

Coding Rank

#98

Model Integrity

Total Score

B-

63 / 100

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs