Active Parameters
48B
Context Length
1,048.576K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
1 Nov 2025
Knowledge Cutoff
Oct 2024
Total Expert Parameters
3.0B
Number of Experts
128
Active Experts
8
Attention Structure
Multi-Head Attention
Hidden Dimension Size
4096
Number of Layers
36
Attention Heads
32
Key-Value Heads
1
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
Kimi Linear 48B A3B Instruct is a large-scale language model that implements a hybrid linear attention architecture, designed to overcome the memory and computational constraints of traditional Transformer models. The core innovation lies in the integration of Kimi Delta Attention (KDA) with Multi-Head Latent Attention (MLA) in a specific 3:1 interleaving ratio. KDA builds upon the Gated DeltaNet framework by introducing a channel-wise gating mechanism that allows for independent control over memory decay across individual feature dimensions. This configuration transforms the attention mechanism into a finite-state recurrent neural network (RNN), providing a constant-state memory footprint regardless of sequence length.
The model utilizes a Mixture-of-Experts (MoE) architecture to manage its 48 billion total parameters, with approximately 3 billion parameters active during any single forward pass. This sparsity, combined with the hybrid attention structure, facilitates high-throughput inference and efficient long-context processing. The KDA layers employ a specialized chunkwise algorithm based on Diagonal-Plus-Low-Rank (DPLR) transition matrices, which optimizes hardware utilization on modern accelerators. By offloading global dependency modeling to periodic MLA layers while maintaining local and recurrent state through KDA, the model achieves a balance between expressive power and linear scaling.
From an implementation perspective, Kimi Linear 48B A3B Instruct serves as a high-efficiency alternative for tasks requiring extensive context windows, supporting up to 1 million tokens. The architecture significantly reduces Key-Value (KV) cache requirements by approximately 75% compared to standard multi-head attention models. This reduction in memory overhead allows for substantially higher decoding speeds in long-sequence applications, such as document analysis and complex reasoning, while maintaining compatibility with standard training and fine-tuning workflows via its open-source MIT-licensed implementation.
Moonshot AI's hybrid linear attention architecture with Kimi Delta Attention for efficient long-context processing.
No evaluation benchmarks for Kimi Linear 48B A3B Instruct available.
Overall Rank
-
Coding Rank
-
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens