Active Parameters
16B
Context Length
128K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
10 Apr 2025
Knowledge Cutoff
-
Total Expert Parameters
3.0B
Number of Experts
384
Active Experts
8
Attention Structure
Multi-Head Attention
Hidden Dimension Size
2048
Number of Layers
24
Attention Heads
16
Key-Value Heads
16
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
Kimi-VL-A3B-Instruct is a multimodal Mixture-of-Experts (MoE) vision-language model developed by Moonshot AI, designed for high-resolution visual perception and long-context reasoning. The model operates on a base architecture that integrates a native-resolution visual encoder, termed MoonViT, with a sparse MoE language decoder. This design facilitates the processing of diverse inputs including single and multi-image sets, video sequences, and extensive document formats. The model is instruction-tuned to support interactive chat and agentic workflows, emphasizing efficiency in both high-resolution image analysis and natural language understanding across extended sequences.
Technically, the model utilizes a sparse MoE language backbone named Moonlight, which contains 16 billion total parameters but activates only 2.8 billion parameters per token. This sparsity is achieved through a routing mechanism that selects 8 experts from a total pool of 384 available experts. The visual component, MoonViT, supports native resolution processing up to 1792x1792 pixels, allowing the model to maintain high fidelity for OCR and detailed graphical analysis without forced resizing. The architecture incorporates a variable-length sequence attention mechanism that is compatible with FlashAttention, ensuring computational efficiency when handling images of various aspect ratios and resolutions.
Kimi-VL-A3B-Instruct is optimized for complex multimodal tasks such as document parsing, long-form video comprehension, and interactive GUI agent operations. Its large context window of 128,000 tokens enables the ingestion of multiple high-resolution images or lengthy video clips alongside extensive textual prompts. By combining the efficiency of MoE with high-resolution visual encoding, the model is suited for applications requiring detailed visual grounding and the ability to reason over long-form, multi-source information in a conversational or agentic context.
Kimi-VL by Moonshot AI is an efficient, open-source Mixture-of-Experts vision-language model. It employs a native-resolution MoonViT encoder and an MoE language model, activating 2.8 billion parameters. The model handles high-resolution visual inputs and processes contexts up to 128K tokens. A "Thinking" variant provides enhanced long-horizon reasoning.
No evaluation benchmarks for Kimi-VL-A3B-Instruct available.
Overall Rank
-
Coding Rank
-
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens