Active Parameters
48B
Context Length
1.05M
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
1 Nov 2025
Knowledge Cutoff
Oct 2024
VRAM requirements for different quantization methods and context sizes
1,024 tokens
Consumer
5x RTX 4090
24GB VRAM
Datacenter
2x NVIDIA A100
80GB VRAM
Apple Silicon
1x Apple M3 Max
128GB VRAM
1,048,576 tokens
Consumer
6x RTX 4090
24GB VRAM
Datacenter
2x NVIDIA A100
80GB VRAM
Apple Silicon
1x Apple M3 Max
128GB VRAM
No evaluation benchmarks for Kimi Linear 48B A3B Instruct available.
Overall Rank
-
Coding Rank
-
Kimi Linear 48B A3B Instruct is a large-scale language model that implements a hybrid linear attention architecture, designed to overcome the memory and computational constraints of traditional Transformer models. The core innovation lies in the integration of Kimi Delta Attention (KDA) with Multi-Head Latent Attention (MLA) in a specific 3:1 interleaving ratio. KDA builds upon the Gated DeltaNet framework by introducing a channel-wise gating mechanism that allows for independent control over memory decay across individual feature dimensions. This configuration transforms the attention mechanism into a finite-state recurrent neural network (RNN), providing a constant-state memory footprint regardless of sequence length.
The model utilizes a Mixture-of-Experts (MoE) architecture to manage its 48 billion total parameters, with approximately 3 billion parameters active during any single forward pass. This sparsity, combined with the hybrid attention structure, facilitates high-throughput inference and efficient long-context processing. The KDA layers employ a specialized chunkwise algorithm based on Diagonal-Plus-Low-Rank (DPLR) transition matrices, which optimizes hardware utilization on modern accelerators. By offloading global dependency modeling to periodic MLA layers while maintaining local and recurrent state through KDA, the model achieves a balance between expressive power and linear scaling.
From an implementation perspective, Kimi Linear 48B A3B Instruct serves as a high-efficiency alternative for tasks requiring extensive context windows, supporting up to 1 million tokens. The architecture significantly reduces Key-Value (KV) cache requirements by approximately 75% compared to standard multi-head attention models. This reduction in memory overhead allows for substantially higher decoding speeds in long-sequence applications, such as document analysis and complex reasoning, while maintaining compatibility with standard training and fine-tuning workflows via its open-source MIT-licensed implementation.
Attention
Attention Structure
Multi-Head Attention
Attention Heads
32
Key-Value Heads
1
Attention Head Dimension
72
Position Embedding
Absolute Position Embedding
RoPE Theta
10,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
4,096
Number of Layers
36
FFN Intermediate Size (Dense)
1,024
Multi-Token Prediction Heads
0
Tokenizer
Vocabulary Size
163,840
Mixture of Experts
Total Expert Parameters
3.0B
Number of Experts
128
Active Experts
8
Shared Experts
1
FFN Intermediate Size (per Expert)
1,024
Dense Layers Before MoE
1
Total Score
70
/ 100
Kimi Linear 48B A3B Instruct demonstrates high transparency in its architectural design and parameter density, supported by a detailed technical report and a permissive MIT license. However, it suffers from significant opacity regarding its training data composition and the environmental impact of its compute resources. While hardware requirements are well-documented for deployment, the lack of a formal versioning system and full evaluation reproducibility limits its overall transparency profile.
Architectural Provenance
The model's architecture is extensively documented in the technical report 'Kimi Linear: An Expressive, Efficient Attention Architecture' (arXiv:2510.26692). It details a hybrid linear attention mechanism (Kimi Delta Attention or KDA) interleaved with Multi-Head Latent Attention (MLA) in a 3:1 ratio. The report provides mathematical formulations for the KDA layers, including the use of Diagonal-Plus-Low-Rank (DPLR) transition matrices and channel-wise gating. The implementation is publicly available via the 'fla-core' library and official GitHub repository, allowing for architectural verification.
Dataset Composition
While the total token count (5.7 trillion tokens) and the name of the pretraining corpus (K2) are disclosed, specific composition details such as the percentage breakdown of web data, code, and books are missing. There is no public documentation on the specific filtering, cleaning, or deduplication methodologies used for the K2 corpus, nor is there a sample of the training data available for public inspection. The information remains at a high-level 'marketing' description of 'high-quality data'.
Tokenizer Integrity
The tokenizer is publicly accessible through the Hugging Face repository and can be loaded using standard 'AutoTokenizer' calls. It is based on the same tokenizer used in the Kimi K2 series, with a known vocabulary size and support for multilingual inputs. Documentation exists regarding its integration with the 'fla-core' library, though a detailed report on its training data alignment is not explicitly provided in the main technical paper.
Parameter Density
Moonshot AI provides precise details regarding parameter density. The model is a Mixture-of-Experts (MoE) with 48 billion total parameters and approximately 3 billion active parameters per forward pass. The architecture activates 8 out of 256 experts per token, including one shared expert. This level of detail regarding sparse vs. dense parameter counts is exemplary and verifiable through the model's configuration files.
Training Compute
Information regarding training compute is extremely limited. While the technical report mentions the use of the 'Muon' optimizer and 'WSD' learning rate schedule, it does not disclose the total GPU/TPU hours, the specific hardware cluster used (e.g., number of H100s), or the training duration. Furthermore, there is no public calculation of the carbon footprint or estimated financial cost of the training run.
Benchmark Reproducibility
The technical report includes results for several standard benchmarks (MMLU-Pro, RULER, RepoQA, MATH500) and specifies the context lengths used (e.g., 4k for MMLU-Pro, 128k for RULER). However, the exact evaluation prompts and few-shot examples are not fully disclosed in a dedicated evaluation repository. While some third-party verification is available through community testing on platforms like OpenRouter, a fully reproducible evaluation suite is not provided.
Identity Consistency
The model consistently identifies as 'Kimi' or a Moonshot AI product in standard instruction-following tasks. It maintains version awareness (Kimi Linear) and is transparent about its hybrid architecture and 1-million-token context window capabilities. There are no documented cases of the model claiming to be a competitor's model (e.g., GPT-4) or denying its nature as an AI.
License Clarity
The model weights and the KDA kernel code are released under the MIT License, which is a standard, highly permissive open-source license. The terms are clear, allowing for both commercial use and derivative works without the restrictive 'semi-free' clauses found in previous Moonshot AI releases. This represents the highest level of licensing transparency.
Hardware Footprint
Moonshot AI and community contributors have provided detailed VRAM requirements for various configurations. Official documentation states that the model can run on a single 80GB VRAM GPU (like an A100 or H100) for basic inference. Third-party documentation on Hugging Face (e.g., cyankiwi) provides specific VRAM footprints for 4-bit AWQ quantization (approx. 28.4 GB) vs. full precision (approx. 91.5 GB), including context length scaling estimates.
Versioning Drift
The model uses a clear naming convention (Kimi-Linear-48B-A3B-Instruct), but there is no public, centralized changelog or semantic versioning system for tracking updates to the weights or the underlying KDA kernels. While the GitHub repository tracks code changes, the 'silent' update of model weights on Hugging Face remains a possibility, and there is no formal deprecation policy for older checkpoints.
Moonshot AI's hybrid linear attention architecture with Kimi Delta Attention for efficient long-context processing.
APX AI
Online