Active Parameters
15B
Context Length
256K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
10 Dec 2025
Knowledge Cutoff
Dec 2024
Attention
Attention Structure
Multi-Head Attention
Attention Heads
64
Key-Value Heads
8
Attention Head Dimension
128
Position Embedding
Absolute Position Embedding
RoPE Theta
640,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
4,096
Number of Layers
48
FFN Intermediate Size (Dense)
11,008
Multi-Token Prediction Heads
1
Tokenizer
Vocabulary Size
151,680
Mixture of Experts
Total Expert Parameters
309.0B
Number of Experts
256
Active Experts
8
Shared Experts
-
FFN Intermediate Size (per Expert)
-
Dense Layers Before MoE
-
The Xiaomi MiMo V2 Flash is a high-efficiency Mixture-of-Experts (MoE) language model engineered for advanced reasoning, software engineering, and autonomous agentic workflows. Built upon a sparse architecture, the model incorporates a total of 309 billion parameters while activating only 15 billion parameters per forward pass, effectively balancing the modeling capacity of a large-scale system with the inference speed and operational efficiency of a significantly smaller dense model. Its development focus centers on high-throughput performance, achieving high decoding speeds through structural innovations designed to alleviate the computational and memory bottlenecks typically associated with large-scale transformer models.
Technically, MiMo V2 Flash introduces a hybrid attention mechanism that interleaves Sliding Window Attention (SWA) and Global Attention (GA) in a 5:1 ratio across its transformer blocks. This configuration utilizes an aggressive 128-token sliding window, which reduces KV-cache memory requirements by nearly six-fold compared to standard global attention, while a learnable attention sink bias ensures stable long-context performance. Furthermore, the model features a native Multi-Token Prediction (MTP) module consisting of lightweight 0.33 billion parameter dense feed-forward blocks. This MTP architecture facilitates parallel token generation and verification, resulting in a reported increase in decoding throughput by 2.0 to 2.6 times relative to conventional autoregressive generation methods.
Pre-trained on a massive 27 trillion token corpus using FP8 mixed precision, MiMo V2 Flash supports a native sequence length of 32,000 tokens and is capable of handling context windows up to 256,000 tokens. The post-training phase utilizes a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm and large-scale reinforcement learning, specifically targeting complex reasoning and multi-step tool use. This specialized training enables the model to perform reliably in demanding technical scenarios, such as document analysis and extended agentic interactions, making it a resource-optimized solution for researchers and developers requiring state-of-the-art performance in open-weight formats.
MiMo-V2-Flash is a Mixture-of-Experts (MoE) model with hybrid attention architecture designed for high-speed reasoning and agentic workflows. It features Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs. The model is optimized for long-context modeling and efficient inference.
Rank
#69
| Benchmark | Score | Rank |
|---|---|---|
Graduate-Level QA GPQA | 0.837 | 15 |
General Text Text Arena | 1393 | 56 |
Web Development WebDev Arena | 1337 | 64 |
Overall Rank
#69
Coding Rank
#73
Total Score
68
/ 100
MiMo V2 Flash exhibits strong transparency in its architectural design and parameter disclosure, providing technical depth rarely seen in large-scale MoE models. However, it suffers from significant opacity regarding training compute resources and the specific composition of its 27-trillion-token dataset. While hardware requirements and licensing are well-defined, the lack of environmental impact data and a detailed versioning history remains a weakness.
Architectural Provenance
The model's architecture is extensively documented in a 31-page technical report and official GitHub repository. It is a sparse Mixture-of-Experts (MoE) decoder-only transformer with 309B total parameters and 15B active parameters. Key innovations like the 5:1 hybrid ratio of Sliding Window Attention (SWA) and Global Attention (GA), the 128-token window size, and the native Multi-Token Prediction (MTP) module are detailed with specific layer configurations (M=8 hybrid blocks, N=5 SWA layers per block). The pre-training methodology using FP8 mixed precision and the post-training Multi-Teacher On-Policy Distillation (MOPD) paradigm are also clearly described.
Dataset Composition
While the total token count (27 trillion) and broad categories (multilingual web, code, math, and domain-specialized corpora) are disclosed, there is no specific percentage breakdown or detailed list of sources. The filtering and cleaning methodologies are mentioned in general terms but lack the granular documentation required for a high score. No sample data or specific dataset names are provided for verification.
Tokenizer Integrity
The tokenizer is publicly available as part of the model weights on Hugging Face, and its use is supported by major frameworks like SGLang and vLLM. However, official documentation lacks a detailed breakdown of the vocabulary size, tokenization algorithm (e.g., BPE vs. SentencePiece), or specific training data alignment. While functional, the technical specifics are less transparent than the model architecture.
Parameter Density
Xiaomi provides exemplary transparency regarding parameter counts. They clearly distinguish between total parameters (309B) and active parameters (15B). The architectural breakdown is precise, noting the 0.33B parameters per block in the MTP module and the expert configuration (256 routed experts, 8 activated per token). This level of detail prevents the common MoE 'parameter inflation' marketing trap.
Training Compute
There is a significant lack of information regarding the compute resources used for training. While the use of FP8 mixed precision and modern NVIDIA GPUs is mentioned, there are no disclosures regarding total GPU/TPU hours, specific hardware cluster size, training duration, or the estimated carbon footprint. This is a major transparency gap.
Benchmark Reproducibility
The technical report provides comprehensive evaluation results across standard benchmarks (SWE-Bench Verified, AIME 2025, GPQA-Diamond, etc.) with specific settings (e.g., 3-shot for BBH). Evaluation code is partially available through the GitHub repository and integration with SGLang/vLLM recipes, though a single-click reproduction script for all reported numbers is not explicitly provided.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as MiMo, an AI developed by Xiaomi, in its recommended system prompts. It maintains clear versioning between the 'Base' and 'Flash' variants and is transparent about its knowledge cutoff (December 2024). There are no reported instances of the model claiming a competitor's identity.
License Clarity
The model weights and inference code are released under the permissive MIT license, which is clearly stated on Hugging Face and the official blog. However, the GitHub repository contains an Apache-2.0 license file, creating a minor conflict in documentation, although both are open-source. Commercial use is explicitly allowed, and the terms are generally clear.
Hardware Footprint
Hardware requirements are well-documented for various use cases. Official guides specify VRAM needs for FP16 (618GB) and FP8, as well as consumer-grade requirements for quantized versions (e.g., 32GB VRAM for IQ3_XS). The impact of context length on memory (KV-cache reduction via SWA) is also technically explained, providing users with realistic deployment expectations.
Versioning Drift
The model uses a basic versioning system (V2 Flash), but there is no public, detailed changelog or semantic versioning history tracking minor updates or weight drifts. While the release is relatively new, the lack of a structured path for tracking future updates or accessing previous weight iterations limits the score.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online