Kimi-VL-A3B-Thinking: Specifications and GPU VRAM Requirements

Kimi-VL-A3B-Thinking

开源

开放权重

活跃参数

16B

上下文长度

128K

模态

Multimodal

架构

Mixture of Experts (MoE)

许可证

MIT License

发布日期

10 Apr 2025

知识截止

技术规格

专家参数总数

3.0B

专家数量

活跃专家

注意力结构

Multi-Head Attention

隐藏维度大小

层数

注意力头

键值头

激活函数

归一化

位置嵌入

Absolute Position Embedding

系统要求

不同量化方法和上下文大小的显存要求

Kimi-VL-A3B-Thinking

Kimi-VL-A3B-Thinking is an advanced vision-language model (VLM) developed by Moonshot AI, engineered to integrate efficient parameter utilization with robust reasoning capabilities. This model is designed for complex problem-solving, particularly those requiring multi-step cognitive processes. It functions as a multimodal intelligence system, capable of interpreting and reasoning across diverse visual and textual inputs, thereby extending the capabilities of large language models into visual domains. The "Thinking" variant is specifically enhanced through long chain-of-thought (CoT) supervised fine-tuning and reinforcement learning to augment its multi-step reasoning proficiencies.

The architectural foundation of Kimi-VL-A3B-Thinking is a Mixture-of-Experts (MoE) configuration, encompassing a total of 16 billion parameters. A notable characteristic of this architecture is its computational efficiency, wherein only approximately 2.8 billion parameters are actively engaged during inference. This design incorporates an MoE language model, a proprietary native-resolution visual encoder termed MoonViT, and an MLP projector for modality fusion. The language processing component is derived from Moonshot AI's Moonlight LLM series, leveraging a pre-trained checkpoint from a substantial text corpus. The MoonViT encoder facilitates the processing of high-resolution visual inputs, including both static images and dynamic video sequences.

This model is primarily applicable to advanced reasoning tasks, with a particular emphasis on mathematical problem-solving and long-chain thought processes. Its functional scope also includes multi-turn agent interaction scenarios, advanced image and video comprehension at a college academic level, and optical character recognition (OCR). The model maintains an extended context window of up to 128,000 tokens, which supports prolonged multi-turn conversations and the analysis of extensive documents or video content. This capability allows the model to process diverse input formats, such as single images, multiple images, and videos, while sustaining computational efficiency during operation. The model supports Flash-Attention 2 and native FP16/bfloat16 precision for faster, efficient runs.

关于 Kimi-VL

Kimi-VL by Moonshot AI is an efficient, open-source Mixture-of-Experts vision-language model. It employs a native-resolution MoonViT encoder and an MoE language model, activating 2.8 billion parameters. The model handles high-resolution visual inputs and processes contexts up to 128K tokens. A "Thinking" variant provides enhanced long-horizon reasoning.

其他 Kimi-VL 模型

Kimi-VL-A3B-Instruct

评估基准

排名适用于本地LLM。

没有可用的 Kimi-VL-A3B-Thinking 评估基准。

排名

编程排名

GPU 要求

完整计算器

量化

选择模型权重的量化方法

上下文大小：1024 个令牌

63k

125k

所需显存:

资源

官方文档阅读论文下载权重源代码

Kimi-VL-A3B-Thinking

技术规格

系统要求

Kimi-VL-A3B-Thinking

关于 Kimi-VL

其他 Kimi-VL 模型

评估基准

排名

GPU 要求

所需显存:

推荐 GPU

资源