ApX logoApX logo

Kimi Linear 48B A3B Instruct

Active Parameters

48B

Context Length

1.05M

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT

Release Date

1 Nov 2025

Knowledge Cutoff

Oct 2024

System Requirements

VRAM requirements for different quantization methods and context sizes

1,024 tokens

102.31 GB VRAM

Consumer

5x RTX 4090

24GB VRAM

Datacenter

2x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

1,048,576 tokens

113.72 GB VRAM

Consumer

6x RTX 4090

24GB VRAM

Datacenter

2x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 4.1k · Context: 1.05M · Vocab: 163.8kx 36 layersRMSNormPre-AttentionMulti-Head Attention32Q / 1KV headsHead dim: 72+RMSNormPre-FFNSparse MoE FFN (8/128 experts)SwiGLUIntermediate: 1k+Final RMSNormOutput Logits

Evaluation Benchmarks

No evaluation benchmarks for Kimi Linear 48B A3B Instruct available.

Rankings

Overall Rank

-

Coding Rank

-

About Kimi Linear 48B A3B Instruct

Kimi Linear 48B A3B Instruct is a large-scale language model that implements a hybrid linear attention architecture, designed to overcome the memory and computational constraints of traditional Transformer models. The core innovation lies in the integration of Kimi Delta Attention (KDA) with Multi-Head Latent Attention (MLA) in a specific 3:1 interleaving ratio. KDA builds upon the Gated DeltaNet framework by introducing a channel-wise gating mechanism that allows for independent control over memory decay across individual feature dimensions. This configuration transforms the attention mechanism into a finite-state recurrent neural network (RNN), providing a constant-state memory footprint regardless of sequence length.

The model utilizes a Mixture-of-Experts (MoE) architecture to manage its 48 billion total parameters, with approximately 3 billion parameters active during any single forward pass. This sparsity, combined with the hybrid attention structure, facilitates high-throughput inference and efficient long-context processing. The KDA layers employ a specialized chunkwise algorithm based on Diagonal-Plus-Low-Rank (DPLR) transition matrices, which optimizes hardware utilization on modern accelerators. By offloading global dependency modeling to periodic MLA layers while maintaining local and recurrent state through KDA, the model achieves a balance between expressive power and linear scaling.

From an implementation perspective, Kimi Linear 48B A3B Instruct serves as a high-efficiency alternative for tasks requiring extensive context windows, supporting up to 1 million tokens. The architecture significantly reduces Key-Value (KV) cache requirements by approximately 75% compared to standard multi-head attention models. This reduction in memory overhead allows for substantially higher decoding speeds in long-sequence applications, such as document analysis and complex reasoning, while maintaining compatibility with standard training and fine-tuning workflows via its open-source MIT-licensed implementation.

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

32

Key-Value Heads

1

Attention Head Dimension

72

Position Embedding

Absolute Position Embedding

RoPE Theta

10,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

36

FFN Intermediate Size (Dense)

1,024

Multi-Token Prediction Heads

0

Tokenizer

Vocabulary Size

163,840

Mixture of Experts

Total Expert Parameters

3.0B

Number of Experts

128

Active Experts

8

Shared Experts

1

FFN Intermediate Size (per Expert)

1,024

Dense Layers Before MoE

1

Model Integrity

Total Score

B

70 / 100

Kimi Linear 48B A3B Instruct Model Integrity Report

Total Score

70

/ 100

B

Audit Note

Kimi Linear 48B A3B Instruct demonstrates high transparency in its architectural design and parameter density, supported by a detailed technical report and a permissive MIT license. However, it suffers from significant opacity regarding its training data composition and the environmental impact of its compute resources. While hardware requirements are well-documented for deployment, the lack of a formal versioning system and full evaluation reproducibility limits its overall transparency profile.

Upstream

20.5 / 30

Architectural Provenance

8.5 / 10

The model's architecture is extensively documented in the technical report 'Kimi Linear: An Expressive, Efficient Attention Architecture' (arXiv:2510.26692). It details a hybrid linear attention mechanism (Kimi Delta Attention or KDA) interleaved with Multi-Head Latent Attention (MLA) in a 3:1 ratio. The report provides mathematical formulations for the KDA layers, including the use of Diagonal-Plus-Low-Rank (DPLR) transition matrices and channel-wise gating. The implementation is publicly available via the 'fla-core' library and official GitHub repository, allowing for architectural verification.

Dataset Composition

4.0 / 10

While the total token count (5.7 trillion tokens) and the name of the pretraining corpus (K2) are disclosed, specific composition details such as the percentage breakdown of web data, code, and books are missing. There is no public documentation on the specific filtering, cleaning, or deduplication methodologies used for the K2 corpus, nor is there a sample of the training data available for public inspection. The information remains at a high-level 'marketing' description of 'high-quality data'.

Tokenizer Integrity

8.0 / 10

The tokenizer is publicly accessible through the Hugging Face repository and can be loaded using standard 'AutoTokenizer' calls. It is based on the same tokenizer used in the Kimi K2 series, with a known vocabulary size and support for multilingual inputs. Documentation exists regarding its integration with the 'fla-core' library, though a detailed report on its training data alignment is not explicitly provided in the main technical paper.

Model

26.5 / 40

Parameter Density

9.0 / 10

Moonshot AI provides precise details regarding parameter density. The model is a Mixture-of-Experts (MoE) with 48 billion total parameters and approximately 3 billion active parameters per forward pass. The architecture activates 8 out of 256 experts per token, including one shared expert. This level of detail regarding sparse vs. dense parameter counts is exemplary and verifiable through the model's configuration files.

Training Compute

2.0 / 10

Information regarding training compute is extremely limited. While the technical report mentions the use of the 'Muon' optimizer and 'WSD' learning rate schedule, it does not disclose the total GPU/TPU hours, the specific hardware cluster used (e.g., number of H100s), or the training duration. Furthermore, there is no public calculation of the carbon footprint or estimated financial cost of the training run.

Benchmark Reproducibility

6.5 / 10

The technical report includes results for several standard benchmarks (MMLU-Pro, RULER, RepoQA, MATH500) and specifies the context lengths used (e.g., 4k for MMLU-Pro, 128k for RULER). However, the exact evaluation prompts and few-shot examples are not fully disclosed in a dedicated evaluation repository. While some third-party verification is available through community testing on platforms like OpenRouter, a fully reproducible evaluation suite is not provided.

Identity Consistency

9.0 / 10

The model consistently identifies as 'Kimi' or a Moonshot AI product in standard instruction-following tasks. It maintains version awareness (Kimi Linear) and is transparent about its hybrid architecture and 1-million-token context window capabilities. There are no documented cases of the model claiming to be a competitor's model (e.g., GPT-4) or denying its nature as an AI.

Downstream

22.5 / 30

License Clarity

10.0 / 10

The model weights and the KDA kernel code are released under the MIT License, which is a standard, highly permissive open-source license. The terms are clear, allowing for both commercial use and derivative works without the restrictive 'semi-free' clauses found in previous Moonshot AI releases. This represents the highest level of licensing transparency.

Hardware Footprint

7.5 / 10

Moonshot AI and community contributors have provided detailed VRAM requirements for various configurations. Official documentation states that the model can run on a single 80GB VRAM GPU (like an A100 or H100) for basic inference. Third-party documentation on Hugging Face (e.g., cyankiwi) provides specific VRAM footprints for 4-bit AWQ quantization (approx. 28.4 GB) vs. full precision (approx. 91.5 GB), including context length scaling estimates.

Versioning Drift

5.0 / 10

The model uses a clear naming convention (Kimi-Linear-48B-A3B-Instruct), but there is no public, centralized changelog or semantic versioning system for tracking updates to the weights or the underlying KDA kernels. While the GitHub repository tracks code changes, the 'silent' update of model weights on Hugging Face remains a possibility, and there is no formal deprecation policy for older checkpoints.

About Kimi Linear

Moonshot AI's hybrid linear attention architecture with Kimi Delta Attention for efficient long-context processing.


Other Kimi Linear Models
  • No related models available