Kimi Linear 48B A3B Instruct: Specifications and GPU VRAM Requirements

Kimi Linear 48B A3B Instruct

Open Source

Open Weights

Active Parameters

48B

Context Length

1,048.576K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT

Release Date

1 Nov 2025

Knowledge Cutoff

Oct 2024

Technical Specifications

Total Expert Parameters

3.0B

Number of Experts

128

Active Experts

Attention Structure

Multi-Head Attention

Hidden Dimension Size

4096

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

Kimi Linear 48B A3B Instruct

Kimi Linear 48B A3B Instruct is a large-scale language model that implements a hybrid linear attention architecture, designed to overcome the memory and computational constraints of traditional Transformer models. The core innovation lies in the integration of Kimi Delta Attention (KDA) with Multi-Head Latent Attention (MLA) in a specific 3:1 interleaving ratio. KDA builds upon the Gated DeltaNet framework by introducing a channel-wise gating mechanism that allows for independent control over memory decay across individual feature dimensions. This configuration transforms the attention mechanism into a finite-state recurrent neural network (RNN), providing a constant-state memory footprint regardless of sequence length.

The model utilizes a Mixture-of-Experts (MoE) architecture to manage its 48 billion total parameters, with approximately 3 billion parameters active during any single forward pass. This sparsity, combined with the hybrid attention structure, facilitates high-throughput inference and efficient long-context processing. The KDA layers employ a specialized chunkwise algorithm based on Diagonal-Plus-Low-Rank (DPLR) transition matrices, which optimizes hardware utilization on modern accelerators. By offloading global dependency modeling to periodic MLA layers while maintaining local and recurrent state through KDA, the model achieves a balance between expressive power and linear scaling.

From an implementation perspective, Kimi Linear 48B A3B Instruct serves as a high-efficiency alternative for tasks requiring extensive context windows, supporting up to 1 million tokens. The architecture significantly reduces Key-Value (KV) cache requirements by approximately 75% compared to standard multi-head attention models. This reduction in memory overhead allows for substantially higher decoding speeds in long-sequence applications, such as document analysis and complex reasoning, while maintaining compatibility with standard training and fine-tuning workflows via its open-source MIT-licensed implementation.

About Kimi Linear

Moonshot AI's hybrid linear attention architecture with Kimi Delta Attention for efficient long-context processing.

Other Kimi Linear Models

No related models available

Evaluation Benchmarks

No evaluation benchmarks for Kimi Linear 48B A3B Instruct available.

Rankings

Overall Rank

Coding Rank

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

512k

1024k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Read the Paper Download Weights Source Code