趋近智
活跃参数
16B
上下文长度
128K
模态
Multimodal
架构
Mixture of Experts (MoE)
许可证
MIT License
发布日期
10 Apr 2025
知识截止
-
专家参数总数
3.0B
专家数量
-
活跃专家
2
注意力结构
Multi-Head Attention
隐藏维度大小
-
层数
-
注意力头
-
键值头
-
激活函数
-
归一化
-
位置嵌入
Absolute Position Embedding
不同量化方法和上下文大小的显存要求
Kimi-VL-A3B-Thinking is an advanced vision-language model (VLM) developed by Moonshot AI, engineered to integrate efficient parameter utilization with robust reasoning capabilities. This model is designed for complex problem-solving, particularly those requiring multi-step cognitive processes. It functions as a multimodal intelligence system, capable of interpreting and reasoning across diverse visual and textual inputs, thereby extending the capabilities of large language models into visual domains. The "Thinking" variant is specifically enhanced through long chain-of-thought (CoT) supervised fine-tuning and reinforcement learning to augment its multi-step reasoning proficiencies.
The architectural foundation of Kimi-VL-A3B-Thinking is a Mixture-of-Experts (MoE) configuration, encompassing a total of 16 billion parameters. A notable characteristic of this architecture is its computational efficiency, wherein only approximately 2.8 billion parameters are actively engaged during inference. This design incorporates an MoE language model, a proprietary native-resolution visual encoder termed MoonViT, and an MLP projector for modality fusion. The language processing component is derived from Moonshot AI's Moonlight LLM series, leveraging a pre-trained checkpoint from a substantial text corpus. The MoonViT encoder facilitates the processing of high-resolution visual inputs, including both static images and dynamic video sequences.
This model is primarily applicable to advanced reasoning tasks, with a particular emphasis on mathematical problem-solving and long-chain thought processes. Its functional scope also includes multi-turn agent interaction scenarios, advanced image and video comprehension at a college academic level, and optical character recognition (OCR). The model maintains an extended context window of up to 128,000 tokens, which supports prolonged multi-turn conversations and the analysis of extensive documents or video content. This capability allows the model to process diverse input formats, such as single images, multiple images, and videos, while sustaining computational efficiency during operation. The model supports Flash-Attention 2 and native FP16/bfloat16 precision for faster, efficient runs.
Kimi-VL by Moonshot AI is an efficient, open-source Mixture-of-Experts vision-language model. It employs a native-resolution MoonViT encoder and an MoE language model, activating 2.8 billion parameters. The model handles high-resolution visual inputs and processes contexts up to 128K tokens. A "Thinking" variant provides enhanced long-horizon reasoning.
排名适用于本地LLM。
没有可用的 Kimi-VL-A3B-Thinking 评估基准。