趋近智
活跃参数
15B
上下文长度
256K
模态
Text
架构
Mixture of Experts (MoE)
许可证
MIT
发布日期
10 Dec 2025
训练数据截止日期
Dec 2024
专家参数总数
309.0B
专家数量
256
活跃专家
8
注意力结构
Multi-Head Attention
隐藏维度大小
4096
层数
48
注意力头
64
键值头
8
激活函数
SwigLU
归一化
RMS Normalization
位置嵌入
Absolute Position Embedding
The Xiaomi MiMo V2 Flash is a high-efficiency Mixture-of-Experts (MoE) language model engineered for advanced reasoning, software engineering, and autonomous agentic workflows. Built upon a sparse architecture, the model incorporates a total of 309 billion parameters while activating only 15 billion parameters per forward pass, effectively balancing the modeling capacity of a large-scale system with the inference speed and operational efficiency of a significantly smaller dense model. Its development focus centers on high-throughput performance, achieving high decoding speeds through structural innovations designed to alleviate the computational and memory bottlenecks typically associated with large-scale transformer models.
Technically, MiMo V2 Flash introduces a hybrid attention mechanism that interleaves Sliding Window Attention (SWA) and Global Attention (GA) in a 5:1 ratio across its transformer blocks. This configuration utilizes an aggressive 128-token sliding window, which reduces KV-cache memory requirements by nearly six-fold compared to standard global attention, while a learnable attention sink bias ensures stable long-context performance. Furthermore, the model features a native Multi-Token Prediction (MTP) module consisting of lightweight 0.33 billion parameter dense feed-forward blocks. This MTP architecture facilitates parallel token generation and verification, resulting in a reported increase in decoding throughput by 2.0 to 2.6 times relative to conventional autoregressive generation methods.
Pre-trained on a massive 27 trillion token corpus using FP8 mixed precision, MiMo V2 Flash supports a native sequence length of 32,000 tokens and is capable of handling context windows up to 256,000 tokens. The post-training phase utilizes a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm and large-scale reinforcement learning, specifically targeting complex reasoning and multi-step tool use. This specialized training enables the model to perform reliably in demanding technical scenarios, such as document analysis and extended agentic interactions, making it a resource-optimized solution for researchers and developers requiring state-of-the-art performance in open-weight formats.
MiMo-V2-Flash is a Mixture-of-Experts (MoE) model with hybrid attention architecture designed for high-speed reasoning and agentic workflows. It features Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs. The model is optimized for long-context modeling and efficient inference.
排名
#40
| 基准 | 分数 | 排名 |
|---|---|---|
Graduate-Level QA GPQA | 0.84 | 9 |