MiMo V2 Flash: Specifications and GPU VRAM Requirements

MiMo V2 Flash

Open Source

Open Weights

Active Parameters

15B

Context Length

256K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT

Release Date

10 Dec 2025

Knowledge Cutoff

Dec 2024

Technical Specifications

Total Expert Parameters

309.0B

Number of Experts

Active Experts

Attention Structure

Multi-Head Attention

Hidden Dimension Size

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

Normalization

Position Embedding

Absolute Position Embedding

System Requirements

VRAM requirements for different quantization methods and context sizes

MiMo V2 Flash

The Xiaomi MiMo V2 Flash model represents a sophisticated iteration within the Mixture-of-Experts (MoE) paradigm, developed by Xiaomi for high-efficiency and high-performance language processing. This foundational model incorporates a total of 309 billion parameters, yet critically activates only 15 billion parameters during each forward pass. This sparse activation mechanism is central to its design philosophy, aiming to optimize computational resource utilization while sustaining advanced capabilities across various natural language tasks. The model's primary objectives include accelerating inference speeds, enhancing performance in complex reasoning, facilitating robust code generation, and enabling advanced agentic workflows for multi-turn interactions. Its architecture is specifically engineered to balance expansive scale with operational efficiency, making it suitable for demanding applications.

From an architectural standpoint, MiMo V2 Flash integrates several technical innovations. Its attention mechanism employs a hybrid configuration, interweaving Sliding Window Attention (SWA) and Global Attention (GA) layers in a 5:1 ratio. This design, featuring an aggressive 128-token sliding window, substantially reduces KV-cache memory requirements by nearly six-fold while preserving long-context performance through a learnable attention sink bias. Furthermore, the model includes a Multi-Token Prediction (MTP) module, which operates with a lightweight 0.33 billion parameter dense feed-forward network. This module facilitates the parallel generation and verification of multiple tokens, resulting in a reported increase in decoding throughput by 2.0 to 2.6 times compared to conventional autoregressive methods. Post-training enhancements are achieved via Multi-Teacher Online Policy Distillation (MOPD) and large-scale agentic Reinforcement Learning (RL), which guide the model towards superior performance in specialized tasks.

MiMo V2 Flash has been trained on an extensive dataset comprising 27 trillion tokens, utilizing FP8 mixed precision for efficient computation. It supports a native sequence length of 32,000 tokens, with capabilities extending to a 256,000 token context window. This large context capacity, combined with its efficient active parameter count and accelerated inference, positions the model for applications requiring extensive contextual understanding, such as document analysis and extended dialogue systems. Its design emphasis on efficiency and performance, particularly in agentic scenarios and tasks involving complex reasoning and software engineering, underscores its utility for technical professionals and researchers requiring a powerful yet resource-optimized language model.

About MiMo V2

MiMo-V2-Flash is a Mixture-of-Experts (MoE) model with hybrid attention architecture designed for high-speed reasoning and agentic workflows. It features Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs. The model is optimized for long-context modeling and efficient inference.

Other MiMo V2 Models

No related models available

Evaluation Benchmarks

Rank

#82

Benchmark	Score	Rank
Professional Knowledge MMLU Pro	0.85	9
Graduate-Level QA GPQA	0.84	39

Rankings

Overall Rank

#82

Coding Rank

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

125k

250k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Read the Paper Download Weights Source Code