Active Parameters
16B
Context Length
128K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
MIT License
Release Date
10 Apr 2025
Knowledge Cutoff
-
Total Expert Parameters
3.0B
Number of Experts
-
Active Experts
2
Attention Structure
Multi-Head Attention
Hidden Dimension Size
-
Number of Layers
-
Attention Heads
-
Key-Value Heads
-
Activation Function
-
Normalization
-
Position Embedding
Absolute Position Embedding
VRAM requirements for different quantization methods and context sizes
Kimi-VL-A3B-Thinking is an advanced vision-language model (VLM) developed by Moonshot AI, engineered to integrate efficient parameter utilization with robust reasoning capabilities. This model is designed for complex problem-solving, particularly those requiring multi-step cognitive processes. It functions as a multimodal intelligence system, capable of interpreting and reasoning across diverse visual and textual inputs, thereby extending the capabilities of large language models into visual domains. The "Thinking" variant is specifically enhanced through long chain-of-thought (CoT) supervised fine-tuning and reinforcement learning to augment its multi-step reasoning proficiencies.
The architectural foundation of Kimi-VL-A3B-Thinking is a Mixture-of-Experts (MoE) configuration, encompassing a total of 16 billion parameters. A notable characteristic of this architecture is its computational efficiency, wherein only approximately 2.8 billion parameters are actively engaged during inference. This design incorporates an MoE language model, a proprietary native-resolution visual encoder termed MoonViT, and an MLP projector for modality fusion. The language processing component is derived from Moonshot AI's Moonlight LLM series, leveraging a pre-trained checkpoint from a substantial text corpus. The MoonViT encoder facilitates the processing of high-resolution visual inputs, including both static images and dynamic video sequences.
This model is primarily applicable to advanced reasoning tasks, with a particular emphasis on mathematical problem-solving and long-chain thought processes. Its functional scope also includes multi-turn agent interaction scenarios, advanced image and video comprehension at a college academic level, and optical character recognition (OCR). The model maintains an extended context window of up to 128,000 tokens, which supports prolonged multi-turn conversations and the analysis of extensive documents or video content. This capability allows the model to process diverse input formats, such as single images, multiple images, and videos, while sustaining computational efficiency during operation. The model supports Flash-Attention 2 and native FP16/bfloat16 precision for faster, efficient runs.
Kimi-VL by Moonshot AI is an efficient, open-source Mixture-of-Experts vision-language model. It employs a native-resolution MoonViT encoder and an MoE language model, activating 2.8 billion parameters. The model handles high-resolution visual inputs and processes contexts up to 128K tokens. A "Thinking" variant provides enhanced long-horizon reasoning.
Ranking is for Local LLMs.
No evaluation benchmarks for Kimi-VL-A3B-Thinking available.
Overall Rank
-
Coding Rank
-
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens