Active Parameters
16B
Context Length
128K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
10 Apr 2025
Knowledge Cutoff
-
Attention
Attention Structure
Multi-Head Attention
Attention Heads
16
Key-Value Heads
16
Attention Head Dimension
-
Position Embedding
Absolute Position Embedding
RoPE Theta
800,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
2,048
Number of Layers
24
FFN Intermediate Size (Dense)
1,408
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
163,840
Mixture of Experts
Total Expert Parameters
3.0B
Number of Experts
384
Active Experts
8
Shared Experts
2
FFN Intermediate Size (per Expert)
1,408
Dense Layers Before MoE
1
Kimi-VL-A3B-Instruct is a multimodal Mixture-of-Experts (MoE) vision-language model developed by Moonshot AI, designed for high-resolution visual perception and long-context reasoning. The model operates on a base architecture that integrates a native-resolution visual encoder, termed MoonViT, with a sparse MoE language decoder. This design facilitates the processing of diverse inputs including single and multi-image sets, video sequences, and extensive document formats. The model is instruction-tuned to support interactive chat and agentic workflows, emphasizing efficiency in both high-resolution image analysis and natural language understanding across extended sequences.
Technically, the model utilizes a sparse MoE language backbone named Moonlight, which contains 16 billion total parameters but activates only 2.8 billion parameters per token. This sparsity is achieved through a routing mechanism that selects 8 experts from a total pool of 384 available experts. The visual component, MoonViT, supports native resolution processing up to 1792x1792 pixels, allowing the model to maintain high fidelity for OCR and detailed graphical analysis without forced resizing. The architecture incorporates a variable-length sequence attention mechanism that is compatible with FlashAttention, ensuring computational efficiency when handling images of various aspect ratios and resolutions.
Kimi-VL-A3B-Instruct is optimized for complex multimodal tasks such as document parsing, long-form video comprehension, and interactive GUI agent operations. Its large context window of 128,000 tokens enables the ingestion of multiple high-resolution images or lengthy video clips alongside extensive textual prompts. By combining the efficiency of MoE with high-resolution visual encoding, the model is suited for applications requiring detailed visual grounding and the ability to reason over long-form, multi-source information in a conversational or agentic context.
Kimi-VL by Moonshot AI is an efficient, open-source Mixture-of-Experts vision-language model. It employs a native-resolution MoonViT encoder and an MoE language model, activating 2.8 billion parameters. The model handles high-resolution visual inputs and processes contexts up to 128K tokens. A "Thinking" variant provides enhanced long-horizon reasoning.
No evaluation benchmarks for Kimi-VL-A3B-Instruct available.
Overall Rank
-
Coding Rank
-
Total Score
69
/ 100
Kimi-VL-A3B-Instruct exhibits strong transparency in its architectural design and parameter density, providing clear documentation of its MoE structure and vision-language integration. The model's use of a permissive MIT license and its consistent self-identification are exemplary. However, significant gaps remain regarding the specific sources of its training data and the total compute resources consumed during its development.
Architectural Provenance
The model architecture is extensively documented in the official technical report and GitHub repository. It utilizes a sparse Mixture-of-Experts (MoE) language backbone called 'Moonlight' (16B total, 2.8B active), which is explicitly stated to be similar to the DeepSeek-V3 architecture. The vision component, 'MoonViT', is a native-resolution encoder initialized from SigLIP-SO-400M and modified with 2D Rotary Positional Embeddings (RoPE) to handle high-resolution inputs up to 1792x1792. The integration via a two-layer MLP projector and the use of FlashAttention-compatible variable-length sequence attention are well-detailed.
Dataset Composition
Moonshot AI provides a high-level breakdown of the training stages and token counts (5.2T text tokens for Moonlight pre-training, followed by 2.3T tokens of joint multimodal and text data). They disclose specific ratios for certain stages, such as upsampling long-context data to 25% during the activation phase. However, the exact sources of the 'pure-text' and 'multimodal' datasets remain described in general terms (e.g., 'curated set of multimodal instruction-response pairs') without a detailed list of public or proprietary sources, which limits full transparency.
Tokenizer Integrity
The tokenizer is publicly accessible via the Hugging Face repository ('tiktoken.model' and 'tokenization_moonshot.py'). It is a Byte Pair Encoding (BPE) tokenizer with a known vocabulary size and clear implementation details within the provided code. The repository includes a 'chat_template.jinja' file, ensuring transparency in how special tokens and conversation formats (ChatML) are handled.
Parameter Density
The model's parameter density is clearly and consistently disclosed across all official documentation. It is specified as a sparse MoE model with 16 billion total parameters and approximately 2.8 billion active parameters per token. The technical report further details the routing mechanism (8 experts selected from 384) and the architectural breakdown between the vision encoder and the language decoder.
Training Compute
Information regarding the specific training compute is conspicuously absent. While the technical report mentions the use of the 'Muon' optimizer for efficiency, it does not disclose the total GPU/TPU hours, the specific hardware cluster used for the Kimi-VL-A3B variant, or the carbon footprint. The lack of these metrics makes it impossible to verify the environmental impact or the exact resource intensity of the training process.
Benchmark Reproducibility
The technical report provides comprehensive results across a wide array of standard benchmarks (MMMU, MathVista, OSWorld, etc.) and specifies the evaluation settings (e.g., Temperature=0.0 for Instruct models). While the evaluation code is partially available in the GitHub repository, the exact prompts and few-shot examples for every benchmark are not fully itemized in a way that allows for push-button reproduction by third parties.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as a Moonshot AI product and maintaining version awareness (distinguishing between the 'Instruct' and 'Thinking' variants). Documentation clearly outlines the capabilities and intended use cases for each variant, and there are no documented instances of the model claiming to be a competitor's product.
License Clarity
The model is released under the highly permissive MIT License, which is clearly stated on the Hugging Face model card and in the GitHub repository. This license explicitly allows for commercial use, modification, and distribution. There are no conflicting terms or restrictive 'open weights' clauses that deviate from standard open-source definitions.
Hardware Footprint
Official documentation and community resources provide clear guidance on VRAM requirements for deployment. For example, it is noted that approximately 42GB of VRAM is required for FP16 inference at a 1K context window, recommending dual RTX 4090s or an A6000. The impact of quantization (INT8/INT4) is also discussed in community-led documentation, though official Moonshot documentation could be more explicit about accuracy-performance tradeoffs for these quantizations.
Versioning Drift
Moonshot AI uses a form of versioning (e.g., the '2506' suffix for updated variants), and the Hugging Face repository maintains a commit history. However, there is no formal, centralized changelog that details specific behavioral changes or performance drift between minor updates. The transition from the original Kimi-VL to the 'Thinking' and '2506' versions is documented, but more granular tracking of weight updates is missing.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online