ApX 标志

趋近智

MiMo V2 Flash

活跃参数

15B

上下文长度

256K

模态

Text

架构

Mixture of Experts (MoE)

许可证

MIT

发布日期

10 Dec 2025

训练数据截止日期

Dec 2024

技术规格

专家参数总数

309.0B

专家数量

-

活跃专家

-

注意力结构

Multi-Head Attention

隐藏维度大小

-

层数

-

注意力头

-

键值头

-

激活函数

-

归一化

-

位置嵌入

Absolute Position Embedding

系统要求

不同量化方法和上下文大小的显存要求

MiMo V2 Flash

The Xiaomi MiMo V2 Flash model represents a sophisticated iteration within the Mixture-of-Experts (MoE) paradigm, developed by Xiaomi for high-efficiency and high-performance language processing. This foundational model incorporates a total of 309 billion parameters, yet critically activates only 15 billion parameters during each forward pass. This sparse activation mechanism is central to its design philosophy, aiming to optimize computational resource utilization while sustaining advanced capabilities across various natural language tasks. The model's primary objectives include accelerating inference speeds, enhancing performance in complex reasoning, facilitating robust code generation, and enabling advanced agentic workflows for multi-turn interactions. Its architecture is specifically engineered to balance expansive scale with operational efficiency, making it suitable for demanding applications.

From an architectural standpoint, MiMo V2 Flash integrates several technical innovations. Its attention mechanism employs a hybrid configuration, interweaving Sliding Window Attention (SWA) and Global Attention (GA) layers in a 5:1 ratio. This design, featuring an aggressive 128-token sliding window, substantially reduces KV-cache memory requirements by nearly six-fold while preserving long-context performance through a learnable attention sink bias. Furthermore, the model includes a Multi-Token Prediction (MTP) module, which operates with a lightweight 0.33 billion parameter dense feed-forward network. This module facilitates the parallel generation and verification of multiple tokens, resulting in a reported increase in decoding throughput by 2.0 to 2.6 times compared to conventional autoregressive methods. Post-training enhancements are achieved via Multi-Teacher Online Policy Distillation (MOPD) and large-scale agentic Reinforcement Learning (RL), which guide the model towards superior performance in specialized tasks.

MiMo V2 Flash has been trained on an extensive dataset comprising 27 trillion tokens, utilizing FP8 mixed precision for efficient computation. It supports a native sequence length of 32,000 tokens, with capabilities extending to a 256,000 token context window. This large context capacity, combined with its efficient active parameter count and accelerated inference, positions the model for applications requiring extensive contextual understanding, such as document analysis and extended dialogue systems. Its design emphasis on efficiency and performance, particularly in agentic scenarios and tasks involving complex reasoning and software engineering, underscores its utility for technical professionals and researchers requiring a powerful yet resource-optimized language model.

关于 MiMo V2

MiMo-V2-Flash is a Mixture-of-Experts (MoE) model with hybrid attention architecture designed for high-speed reasoning and agentic workflows. It features Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs. The model is optimized for long-context modeling and efficient inference.


其他 MiMo V2 模型
  • 没有相关模型

评估基准

排名适用于本地LLM。

没有可用的 MiMo V2 Flash 评估基准。

排名

排名

-

编程排名

-

GPU 要求

完整计算器

选择模型权重的量化方法

上下文大小:1024 个令牌

1k
125k
250k

所需显存:

推荐 GPU