ApX logoApX logo

MiMo V2 Flash

Active Parameters

15B

Context Length

256K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT

Release Date

10 Dec 2025

Knowledge Cutoff

Dec 2024

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

64

Key-Value Heads

8

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

640,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

48

FFN Intermediate Size (Dense)

11,008

Multi-Token Prediction Heads

1

Tokenizer

Vocabulary Size

151,680

Mixture of Experts

Total Expert Parameters

309.0B

Number of Experts

256

Active Experts

8

Shared Experts

-

FFN Intermediate Size (per Expert)

-

Dense Layers Before MoE

-

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 4.1k · Context: 256K · Vocab: 151.7kx 48 layersRMSNormPre-AttentionMulti-Head Attention64Q / 8KV headsHead dim: 128+RMSNormPre-FFNSparse MoE FFN (8/256 experts)SwiGLU+Final RMSNormOutput Logits

MiMo V2 Flash

The Xiaomi MiMo V2 Flash is a high-efficiency Mixture-of-Experts (MoE) language model engineered for advanced reasoning, software engineering, and autonomous agentic workflows. Built upon a sparse architecture, the model incorporates a total of 309 billion parameters while activating only 15 billion parameters per forward pass, effectively balancing the modeling capacity of a large-scale system with the inference speed and operational efficiency of a significantly smaller dense model. Its development focus centers on high-throughput performance, achieving high decoding speeds through structural innovations designed to alleviate the computational and memory bottlenecks typically associated with large-scale transformer models.

Technically, MiMo V2 Flash introduces a hybrid attention mechanism that interleaves Sliding Window Attention (SWA) and Global Attention (GA) in a 5:1 ratio across its transformer blocks. This configuration utilizes an aggressive 128-token sliding window, which reduces KV-cache memory requirements by nearly six-fold compared to standard global attention, while a learnable attention sink bias ensures stable long-context performance. Furthermore, the model features a native Multi-Token Prediction (MTP) module consisting of lightweight 0.33 billion parameter dense feed-forward blocks. This MTP architecture facilitates parallel token generation and verification, resulting in a reported increase in decoding throughput by 2.0 to 2.6 times relative to conventional autoregressive generation methods.

Pre-trained on a massive 27 trillion token corpus using FP8 mixed precision, MiMo V2 Flash supports a native sequence length of 32,000 tokens and is capable of handling context windows up to 256,000 tokens. The post-training phase utilizes a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm and large-scale reinforcement learning, specifically targeting complex reasoning and multi-step tool use. This specialized training enables the model to perform reliably in demanding technical scenarios, such as document analysis and extended agentic interactions, making it a resource-optimized solution for researchers and developers requiring state-of-the-art performance in open-weight formats.

About MiMo V2

MiMo-V2-Flash is a Mixture-of-Experts (MoE) model with hybrid attention architecture designed for high-speed reasoning and agentic workflows. It features Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs. The model is optimized for long-context modeling and efficient inference.


Other MiMo V2 Models
  • No related models available

Evaluation Benchmarks

Rank

#69

BenchmarkScoreRank

Graduate-Level QA

GPQA

0.837

15

General Text

Text Arena

1393

56

Web Development

WebDev Arena

1337

64

Rankings

Overall Rank

#69

Coding Rank

#73

Model Integrity

Total Score

B

68 / 100

MiMo V2 Flash Model Integrity Report

Total Score

68

/ 100

B

Audit Note

MiMo V2 Flash exhibits strong transparency in its architectural design and parameter disclosure, providing technical depth rarely seen in large-scale MoE models. However, it suffers from significant opacity regarding training compute resources and the specific composition of its 27-trillion-token dataset. While hardware requirements and licensing are well-defined, the lack of environmental impact data and a detailed versioning history remains a weakness.

Upstream

18.5 / 30

Architectural Provenance

8.0 / 10

The model's architecture is extensively documented in a 31-page technical report and official GitHub repository. It is a sparse Mixture-of-Experts (MoE) decoder-only transformer with 309B total parameters and 15B active parameters. Key innovations like the 5:1 hybrid ratio of Sliding Window Attention (SWA) and Global Attention (GA), the 128-token window size, and the native Multi-Token Prediction (MTP) module are detailed with specific layer configurations (M=8 hybrid blocks, N=5 SWA layers per block). The pre-training methodology using FP8 mixed precision and the post-training Multi-Teacher On-Policy Distillation (MOPD) paradigm are also clearly described.

Dataset Composition

4.5 / 10

While the total token count (27 trillion) and broad categories (multilingual web, code, math, and domain-specialized corpora) are disclosed, there is no specific percentage breakdown or detailed list of sources. The filtering and cleaning methodologies are mentioned in general terms but lack the granular documentation required for a high score. No sample data or specific dataset names are provided for verification.

Tokenizer Integrity

6.0 / 10

The tokenizer is publicly available as part of the model weights on Hugging Face, and its use is supported by major frameworks like SGLang and vLLM. However, official documentation lacks a detailed breakdown of the vocabulary size, tokenization algorithm (e.g., BPE vs. SentencePiece), or specific training data alignment. While functional, the technical specifics are less transparent than the model architecture.

Model

27.5 / 40

Parameter Density

9.0 / 10

Xiaomi provides exemplary transparency regarding parameter counts. They clearly distinguish between total parameters (309B) and active parameters (15B). The architectural breakdown is precise, noting the 0.33B parameters per block in the MTP module and the expert configuration (256 routed experts, 8 activated per token). This level of detail prevents the common MoE 'parameter inflation' marketing trap.

Training Compute

2.0 / 10

There is a significant lack of information regarding the compute resources used for training. While the use of FP8 mixed precision and modern NVIDIA GPUs is mentioned, there are no disclosures regarding total GPU/TPU hours, specific hardware cluster size, training duration, or the estimated carbon footprint. This is a major transparency gap.

Benchmark Reproducibility

7.0 / 10

The technical report provides comprehensive evaluation results across standard benchmarks (SWE-Bench Verified, AIME 2025, GPQA-Diamond, etc.) with specific settings (e.g., 3-shot for BBH). Evaluation code is partially available through the GitHub repository and integration with SGLang/vLLM recipes, though a single-click reproduction script for all reported numbers is not explicitly provided.

Identity Consistency

9.5 / 10

The model demonstrates high identity consistency, correctly identifying itself as MiMo, an AI developed by Xiaomi, in its recommended system prompts. It maintains clear versioning between the 'Base' and 'Flash' variants and is transparent about its knowledge cutoff (December 2024). There are no reported instances of the model claiming a competitor's identity.

Downstream

21.5 / 30

License Clarity

8.5 / 10

The model weights and inference code are released under the permissive MIT license, which is clearly stated on Hugging Face and the official blog. However, the GitHub repository contains an Apache-2.0 license file, creating a minor conflict in documentation, although both are open-source. Commercial use is explicitly allowed, and the terms are generally clear.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented for various use cases. Official guides specify VRAM needs for FP16 (618GB) and FP8, as well as consumer-grade requirements for quantized versions (e.g., 32GB VRAM for IQ3_XS). The impact of context length on memory (KV-cache reduction via SWA) is also technically explained, providing users with realistic deployment expectations.

Versioning Drift

5.0 / 10

The model uses a basic versioning system (V2 Flash), but there is no public, detailed changelog or semantic versioning history tracking minor updates or weight drifts. While the release is relatively new, the lack of a structured path for tracking future updates or accessing previous weight iterations limits the score.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
125k
250k

VRAM Required:

Recommended GPUs

MiMo V2 Flash: Specifications and GPU VRAM Requirements