Kimi-VL-A3B-Thinking: Specifications and GPU VRAM Requirements

Kimi-VL-A3B-Thinking

Open Source

Open Weights

Active Parameters

16B

Context Length

128K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

MIT License

Release Date

10 Apr 2025

Knowledge Cutoff

Technical Specifications

Total Expert Parameters

3.0B

Number of Experts

Active Experts

Attention Structure

Multi-Head Attention

Hidden Dimension Size

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

Normalization

Position Embedding

Absolute Position Embedding

System Requirements

VRAM requirements for different quantization methods and context sizes

Kimi-VL-A3B-Thinking

Kimi-VL-A3B-Thinking is an advanced vision-language model (VLM) developed by Moonshot AI, engineered to integrate efficient parameter utilization with robust reasoning capabilities. This model is designed for complex problem-solving, particularly those requiring multi-step cognitive processes. It functions as a multimodal intelligence system, capable of interpreting and reasoning across diverse visual and textual inputs, thereby extending the capabilities of large language models into visual domains. The "Thinking" variant is specifically enhanced through long chain-of-thought (CoT) supervised fine-tuning and reinforcement learning to augment its multi-step reasoning proficiencies.

The architectural foundation of Kimi-VL-A3B-Thinking is a Mixture-of-Experts (MoE) configuration, encompassing a total of 16 billion parameters. A notable characteristic of this architecture is its computational efficiency, wherein only approximately 2.8 billion parameters are actively engaged during inference. This design incorporates an MoE language model, a proprietary native-resolution visual encoder termed MoonViT, and an MLP projector for modality fusion. The language processing component is derived from Moonshot AI's Moonlight LLM series, leveraging a pre-trained checkpoint from a substantial text corpus. The MoonViT encoder facilitates the processing of high-resolution visual inputs, including both static images and dynamic video sequences.

This model is primarily applicable to advanced reasoning tasks, with a particular emphasis on mathematical problem-solving and long-chain thought processes. Its functional scope also includes multi-turn agent interaction scenarios, advanced image and video comprehension at a college academic level, and optical character recognition (OCR). The model maintains an extended context window of up to 128,000 tokens, which supports prolonged multi-turn conversations and the analysis of extensive documents or video content. This capability allows the model to process diverse input formats, such as single images, multiple images, and videos, while sustaining computational efficiency during operation. The model supports Flash-Attention 2 and native FP16/bfloat16 precision for faster, efficient runs.

About Kimi-VL

Kimi-VL by Moonshot AI is an efficient, open-source Mixture-of-Experts vision-language model. It employs a native-resolution MoonViT encoder and an MoE language model, activating 2.8 billion parameters. The model handles high-resolution visual inputs and processes contexts up to 128K tokens. A "Thinking" variant provides enhanced long-horizon reasoning.

Other Kimi-VL Models

Kimi-VL-A3B-Instruct

Evaluation Benchmarks

Ranking is for Local LLMs.

No evaluation benchmarks for Kimi-VL-A3B-Thinking available.

Rankings

Overall Rank

Coding Rank

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

63k

125k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Read the Paper Download Weights Source Code