ApX logoApX logo

Kimi-VL-A3B-Instruct

Active Parameters

16B

Context Length

128K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

MIT

Release Date

10 Apr 2025

Knowledge Cutoff

-

Technical Specifications

Total Expert Parameters

3.0B

Number of Experts

384

Active Experts

8

Attention Structure

Multi-Head Attention

Hidden Dimension Size

2048

Number of Layers

24

Attention Heads

16

Key-Value Heads

16

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

Kimi-VL-A3B-Instruct

Kimi-VL-A3B-Instruct is a multimodal Mixture-of-Experts (MoE) vision-language model developed by Moonshot AI, designed for high-resolution visual perception and long-context reasoning. The model operates on a base architecture that integrates a native-resolution visual encoder, termed MoonViT, with a sparse MoE language decoder. This design facilitates the processing of diverse inputs including single and multi-image sets, video sequences, and extensive document formats. The model is instruction-tuned to support interactive chat and agentic workflows, emphasizing efficiency in both high-resolution image analysis and natural language understanding across extended sequences.

Technically, the model utilizes a sparse MoE language backbone named Moonlight, which contains 16 billion total parameters but activates only 2.8 billion parameters per token. This sparsity is achieved through a routing mechanism that selects 8 experts from a total pool of 384 available experts. The visual component, MoonViT, supports native resolution processing up to 1792x1792 pixels, allowing the model to maintain high fidelity for OCR and detailed graphical analysis without forced resizing. The architecture incorporates a variable-length sequence attention mechanism that is compatible with FlashAttention, ensuring computational efficiency when handling images of various aspect ratios and resolutions.

Kimi-VL-A3B-Instruct is optimized for complex multimodal tasks such as document parsing, long-form video comprehension, and interactive GUI agent operations. Its large context window of 128,000 tokens enables the ingestion of multiple high-resolution images or lengthy video clips alongside extensive textual prompts. By combining the efficiency of MoE with high-resolution visual encoding, the model is suited for applications requiring detailed visual grounding and the ability to reason over long-form, multi-source information in a conversational or agentic context.

About Kimi-VL

Kimi-VL by Moonshot AI is an efficient, open-source Mixture-of-Experts vision-language model. It employs a native-resolution MoonViT encoder and an MoE language model, activating 2.8 billion parameters. The model handles high-resolution visual inputs and processes contexts up to 128K tokens. A "Thinking" variant provides enhanced long-horizon reasoning.


Other Kimi-VL Models

Evaluation Benchmarks

No evaluation benchmarks for Kimi-VL-A3B-Instruct available.

Rankings

Overall Rank

-

Coding Rank

-

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs