Active Parameters
48B
Context Length
1,048.576K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
1 Nov 2025
Knowledge Cutoff
-
Total Expert Parameters
3.0B
Number of Experts
-
Active Experts
-
Attention Structure
Multi-Head Attention
Hidden Dimension Size
-
Number of Layers
-
Attention Heads
-
Key-Value Heads
-
Activation Function
-
Normalization
-
Position Embedding
Absolute Position Embedding
VRAM requirements for different quantization methods and context sizes
Kimi Linear is a sophisticated large language model engineered by Moonshot AI, distinguished by its hybrid linear attention architecture. This model variant, Kimi Linear 48B A3B Instruct, integrates Kimi Delta Attention (KDA) with Multi-Head Latent Attention (MLA) layers. KDA represents an advanced linear attention mechanism, extending the Gated DeltaNet by incorporating a finer-grained, channel-wise gating mechanism. This design allows for independent control over memory decay rates across individual feature dimensions, thereby enhancing the regulation of the finite-state recurrent neural network (RNN) memory.
The Kimi Linear architecture employs a specific 3:1 ratio, interleaving KDA layers with periodic MLA layers. This strategic combination aims to balance computational efficiency with the ability to process global information effectively. The underlying chunkwise algorithm within KDA achieves hardware efficiency through a specialized variant of Diagonal-Plus-Low-Rank (DPLR) transition matrices. This approach significantly reduces computational overhead compared to general DPLR formulations, aligning with the classical delta rule while offering more consistency.
The design of Kimi Linear is particularly suited for applications requiring extended context processing and high decoding throughput. By reducing the key-value (KV) cache requirements by up to 75%, it mitigates a common bottleneck in transformer architectures. This efficiency gain enables the model to handle contexts up to 1 million tokens, achieving up to 6x faster decoding throughput in such scenarios. Kimi Linear functions as a drop-in replacement for traditional full attention architectures, offering performance and efficiency for tasks involving longer input and output sequences, including those found in reinforcement learning.
Moonshot AI's hybrid linear attention architecture with Kimi Delta Attention for efficient long-context processing.
Ranking is for Local LLMs.
No evaluation benchmarks for Kimi Linear 48B A3B Instruct available.
Overall Rank
-
Coding Rank
-
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens