ApX logo

Kimi Linear 48B A3B Instruct

Active Parameters

48B

Context Length

1,048.576K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT

Release Date

1 Nov 2025

Knowledge Cutoff

-

Technical Specifications

Total Expert Parameters

3.0B

Number of Experts

-

Active Experts

-

Attention Structure

Multi-Head Attention

Hidden Dimension Size

-

Number of Layers

-

Attention Heads

-

Key-Value Heads

-

Activation Function

-

Normalization

-

Position Embedding

Absolute Position Embedding

System Requirements

VRAM requirements for different quantization methods and context sizes

Kimi Linear 48B A3B Instruct

Kimi Linear is a sophisticated large language model engineered by Moonshot AI, distinguished by its hybrid linear attention architecture. This model variant, Kimi Linear 48B A3B Instruct, integrates Kimi Delta Attention (KDA) with Multi-Head Latent Attention (MLA) layers. KDA represents an advanced linear attention mechanism, extending the Gated DeltaNet by incorporating a finer-grained, channel-wise gating mechanism. This design allows for independent control over memory decay rates across individual feature dimensions, thereby enhancing the regulation of the finite-state recurrent neural network (RNN) memory.

The Kimi Linear architecture employs a specific 3:1 ratio, interleaving KDA layers with periodic MLA layers. This strategic combination aims to balance computational efficiency with the ability to process global information effectively. The underlying chunkwise algorithm within KDA achieves hardware efficiency through a specialized variant of Diagonal-Plus-Low-Rank (DPLR) transition matrices. This approach significantly reduces computational overhead compared to general DPLR formulations, aligning with the classical delta rule while offering more consistency.

The design of Kimi Linear is particularly suited for applications requiring extended context processing and high decoding throughput. By reducing the key-value (KV) cache requirements by up to 75%, it mitigates a common bottleneck in transformer architectures. This efficiency gain enables the model to handle contexts up to 1 million tokens, achieving up to 6x faster decoding throughput in such scenarios. Kimi Linear functions as a drop-in replacement for traditional full attention architectures, offering performance and efficiency for tasks involving longer input and output sequences, including those found in reinforcement learning.

About Kimi Linear

Moonshot AI's hybrid linear attention architecture with Kimi Delta Attention for efficient long-context processing.


Other Kimi Linear Models
  • No related models available

Evaluation Benchmarks

Ranking is for Local LLMs.

No evaluation benchmarks for Kimi Linear 48B A3B Instruct available.

Rankings

Overall Rank

-

Coding Rank

-

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
512k
1024k

VRAM Required:

Recommended GPUs