ApX logoApX logo

Kimi-VL-A3B-Instruct

Active Parameters

16B

Context Length

128K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

MIT

Release Date

10 Apr 2025

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

16

Key-Value Heads

16

Attention Head Dimension

-

Position Embedding

Absolute Position Embedding

RoPE Theta

800,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

2,048

Number of Layers

24

FFN Intermediate Size (Dense)

1,408

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

163,840

Mixture of Experts

Total Expert Parameters

3.0B

Number of Experts

384

Active Experts

8

Shared Experts

2

FFN Intermediate Size (per Expert)

1,408

Dense Layers Before MoE

1

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 2k · Context: 128k · Vocab: 163.8kx 24 layersRMSNormPre-AttentionMulti-Head Attention16Q / 16KV headsHead dim: 128+RMSNormPre-FFNSparse MoE FFN (8/384 experts)SwiGLUIntermediate: 1.4k+Final RMSNormOutput Logits

Kimi-VL-A3B-Instruct

Kimi-VL-A3B-Instruct is a multimodal Mixture-of-Experts (MoE) vision-language model developed by Moonshot AI, designed for high-resolution visual perception and long-context reasoning. The model operates on a base architecture that integrates a native-resolution visual encoder, termed MoonViT, with a sparse MoE language decoder. This design facilitates the processing of diverse inputs including single and multi-image sets, video sequences, and extensive document formats. The model is instruction-tuned to support interactive chat and agentic workflows, emphasizing efficiency in both high-resolution image analysis and natural language understanding across extended sequences.

Technically, the model utilizes a sparse MoE language backbone named Moonlight, which contains 16 billion total parameters but activates only 2.8 billion parameters per token. This sparsity is achieved through a routing mechanism that selects 8 experts from a total pool of 384 available experts. The visual component, MoonViT, supports native resolution processing up to 1792x1792 pixels, allowing the model to maintain high fidelity for OCR and detailed graphical analysis without forced resizing. The architecture incorporates a variable-length sequence attention mechanism that is compatible with FlashAttention, ensuring computational efficiency when handling images of various aspect ratios and resolutions.

Kimi-VL-A3B-Instruct is optimized for complex multimodal tasks such as document parsing, long-form video comprehension, and interactive GUI agent operations. Its large context window of 128,000 tokens enables the ingestion of multiple high-resolution images or lengthy video clips alongside extensive textual prompts. By combining the efficiency of MoE with high-resolution visual encoding, the model is suited for applications requiring detailed visual grounding and the ability to reason over long-form, multi-source information in a conversational or agentic context.

About Kimi-VL

Kimi-VL by Moonshot AI is an efficient, open-source Mixture-of-Experts vision-language model. It employs a native-resolution MoonViT encoder and an MoE language model, activating 2.8 billion parameters. The model handles high-resolution visual inputs and processes contexts up to 128K tokens. A "Thinking" variant provides enhanced long-horizon reasoning.


Other Kimi-VL Models

Evaluation Benchmarks

No evaluation benchmarks for Kimi-VL-A3B-Instruct available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

69 / 100

Kimi-VL-A3B-Instruct Model Integrity Report

Total Score

69

/ 100

B

Audit Note

Kimi-VL-A3B-Instruct exhibits strong transparency in its architectural design and parameter density, providing clear documentation of its MoE structure and vision-language integration. The model's use of a permissive MIT license and its consistent self-identification are exemplary. However, significant gaps remain regarding the specific sources of its training data and the total compute resources consumed during its development.

Upstream

21.5 / 30

Architectural Provenance

8.0 / 10

The model architecture is extensively documented in the official technical report and GitHub repository. It utilizes a sparse Mixture-of-Experts (MoE) language backbone called 'Moonlight' (16B total, 2.8B active), which is explicitly stated to be similar to the DeepSeek-V3 architecture. The vision component, 'MoonViT', is a native-resolution encoder initialized from SigLIP-SO-400M and modified with 2D Rotary Positional Embeddings (RoPE) to handle high-resolution inputs up to 1792x1792. The integration via a two-layer MLP projector and the use of FlashAttention-compatible variable-length sequence attention are well-detailed.

Dataset Composition

5.0 / 10

Moonshot AI provides a high-level breakdown of the training stages and token counts (5.2T text tokens for Moonlight pre-training, followed by 2.3T tokens of joint multimodal and text data). They disclose specific ratios for certain stages, such as upsampling long-context data to 25% during the activation phase. However, the exact sources of the 'pure-text' and 'multimodal' datasets remain described in general terms (e.g., 'curated set of multimodal instruction-response pairs') without a detailed list of public or proprietary sources, which limits full transparency.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the Hugging Face repository ('tiktoken.model' and 'tokenization_moonshot.py'). It is a Byte Pair Encoding (BPE) tokenizer with a known vocabulary size and clear implementation details within the provided code. The repository includes a 'chat_template.jinja' file, ensuring transparency in how special tokens and conversation formats (ChatML) are handled.

Model

26.0 / 40

Parameter Density

9.0 / 10

The model's parameter density is clearly and consistently disclosed across all official documentation. It is specified as a sparse MoE model with 16 billion total parameters and approximately 2.8 billion active parameters per token. The technical report further details the routing mechanism (8 experts selected from 384) and the architectural breakdown between the vision encoder and the language decoder.

Training Compute

2.0 / 10

Information regarding the specific training compute is conspicuously absent. While the technical report mentions the use of the 'Muon' optimizer for efficiency, it does not disclose the total GPU/TPU hours, the specific hardware cluster used for the Kimi-VL-A3B variant, or the carbon footprint. The lack of these metrics makes it impossible to verify the environmental impact or the exact resource intensity of the training process.

Benchmark Reproducibility

6.0 / 10

The technical report provides comprehensive results across a wide array of standard benchmarks (MMMU, MathVista, OSWorld, etc.) and specifies the evaluation settings (e.g., Temperature=0.0 for Instruct models). While the evaluation code is partially available in the GitHub repository, the exact prompts and few-shot examples for every benchmark are not fully itemized in a way that allows for push-button reproduction by third parties.

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying itself as a Moonshot AI product and maintaining version awareness (distinguishing between the 'Instruct' and 'Thinking' variants). Documentation clearly outlines the capabilities and intended use cases for each variant, and there are no documented instances of the model claiming to be a competitor's product.

Downstream

21.5 / 30

License Clarity

9.5 / 10

The model is released under the highly permissive MIT License, which is clearly stated on the Hugging Face model card and in the GitHub repository. This license explicitly allows for commercial use, modification, and distribution. There are no conflicting terms or restrictive 'open weights' clauses that deviate from standard open-source definitions.

Hardware Footprint

7.0 / 10

Official documentation and community resources provide clear guidance on VRAM requirements for deployment. For example, it is noted that approximately 42GB of VRAM is required for FP16 inference at a 1K context window, recommending dual RTX 4090s or an A6000. The impact of quantization (INT8/INT4) is also discussed in community-led documentation, though official Moonshot documentation could be more explicit about accuracy-performance tradeoffs for these quantizations.

Versioning Drift

5.0 / 10

Moonshot AI uses a form of versioning (e.g., the '2506' suffix for updated variants), and the Hugging Face repository maintains a commit history. However, there is no formal, centralized changelog that details specific behavioral changes or performance drift between minor updates. The transition from the original Kimi-VL to the 'Thinking' and '2506' versions is documented, but more granular tracking of weight updates is missing.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs

Kimi-VL-A3B-Instruct: Specifications and GPU VRAM Requirements