MiMo V2 Flash：规格和 GPU 显存要求

MiMo V2 Flash

开源

开放权重

活跃参数

15B

上下文长度

256K

模态

Text

架构

Mixture of Experts (MoE)

许可证

MIT

发布日期

10 Dec 2025

训练数据截止日期

Dec 2024

技术规格

专家参数总数

309.0B

专家数量

256

活跃专家

注意力结构

Multi-Head Attention

隐藏维度大小

4096

层数

注意力头

键值头

激活函数

SwigLU

归一化

RMS Normalization

位置嵌入

Absolute Position Embedding

MiMo V2 Flash

The Xiaomi MiMo V2 Flash is a high-efficiency Mixture-of-Experts (MoE) language model engineered for advanced reasoning, software engineering, and autonomous agentic workflows. Built upon a sparse architecture, the model incorporates a total of 309 billion parameters while activating only 15 billion parameters per forward pass, effectively balancing the modeling capacity of a large-scale system with the inference speed and operational efficiency of a significantly smaller dense model. Its development focus centers on high-throughput performance, achieving high decoding speeds through structural innovations designed to alleviate the computational and memory bottlenecks typically associated with large-scale transformer models.

Technically, MiMo V2 Flash introduces a hybrid attention mechanism that interleaves Sliding Window Attention (SWA) and Global Attention (GA) in a 5:1 ratio across its transformer blocks. This configuration utilizes an aggressive 128-token sliding window, which reduces KV-cache memory requirements by nearly six-fold compared to standard global attention, while a learnable attention sink bias ensures stable long-context performance. Furthermore, the model features a native Multi-Token Prediction (MTP) module consisting of lightweight 0.33 billion parameter dense feed-forward blocks. This MTP architecture facilitates parallel token generation and verification, resulting in a reported increase in decoding throughput by 2.0 to 2.6 times relative to conventional autoregressive generation methods.

Pre-trained on a massive 27 trillion token corpus using FP8 mixed precision, MiMo V2 Flash supports a native sequence length of 32,000 tokens and is capable of handling context windows up to 256,000 tokens. The post-training phase utilizes a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm and large-scale reinforcement learning, specifically targeting complex reasoning and multi-step tool use. This specialized training enables the model to perform reliably in demanding technical scenarios, such as document analysis and extended agentic interactions, making it a resource-optimized solution for researchers and developers requiring state-of-the-art performance in open-weight formats.

关于 MiMo V2

MiMo-V2-Flash is a Mixture-of-Experts (MoE) model with hybrid attention architecture designed for high-speed reasoning and agentic workflows. It features Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs. The model is optimized for long-context modeling and efficient inference.

其他 MiMo V2 模型

没有相关模型

评估基准

排名

#40

基准	分数	排名
Graduate-Level QA GPQA	0.84	9

排名

#40

编程排名

模型透明度

总分

68 / 100

上游

18.5 / 30

模型

27.5 / 40

下游

21.5 / 30

GPU 要求

完整计算器

量化

选择模型权重的量化方法

上下文大小：1024 个令牌

125k

250k

所需显存:

资源

官方文档发布说明阅读论文下载权重源代码

MiMo V2 Flash

技术规格

MiMo V2 Flash

关于 MiMo V2

其他 MiMo V2 模型

评估基准

排名

模型透明度

GPU 要求

所需显存:

推荐 GPU

资源