Active Parameters
1T
Context Length
128K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
Modified MIT License
Release Date
11 Jul 2025
Knowledge Cutoff
-
Attention
Attention Structure
Multi-Layer Attention
Attention Heads
64
Key-Value Heads
64
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
50,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
7,168
Number of Layers
61
FFN Intermediate Size (Dense)
2,048
Multi-Token Prediction Heads
0
Tokenizer
Vocabulary Size
163,840
Mixture of Experts
Total Expert Parameters
32.0B
Number of Experts
384
Active Experts
8
Shared Experts
1
FFN Intermediate Size (per Expert)
2,048
Dense Layers Before MoE
1
Kimi K2-Base is a foundational large language model developed by Moonshot AI, designed for researchers and developers who require a customizable base for specific applications. It is engineered to facilitate agentic tasks, encompassing advanced code generation, multi-step problem-solving, and the autonomous utilization of external tools and APIs. This model provides a robust platform for developing tailored AI systems across diverse domains, such as legal analysis, scientific research, and specialized conversational interfaces.
Architecturally, Kimi K2-Base is a Mixture-of-Experts (MoE) transformer model. It comprises a total of 1 trillion parameters, with 32 billion parameters activated during each inference. The architecture integrates 384 specialized experts, with 8 experts dynamically selected per token to process inputs. A key innovation in its development is the MuonClip optimizer, proprietary to Moonshot AI, which addresses training instability in large-scale models by mitigating exploding attention logits. The model's internal structure includes 61 layers, an attention hidden dimension of 7168, and employs 64 attention heads along with SwiGLU activation functions.
The Kimi K2-Base model supports a substantial context window of 128,000 tokens, allowing it to process and analyze extended inputs and multi-turn interactions effectively. This design contributes to its efficiency in inference and makes it suitable for applications requiring extensive contextual understanding. Its optimization for agentic intelligence signifies its capability to interpret goals and execute complex workflows without continuous human intervention. The model was pre-trained on an extensive dataset of 15.5 trillion tokens, supporting its performance across various knowledge, reasoning, and coding tasks.
Moonshot AI's Kimi K2 is a Mixture-of-Experts model featuring one trillion total parameters, activating 32 billion per token. Designed for agentic intelligence, it utilizes a sparse architecture with 384 experts and the MuonClip optimizer for training stability, supporting a 128K token context window.
Rank
#70
| Benchmark | Score | Rank |
|---|---|---|
Summarization ProLLM Summarization | 0.93 | 6 |
General Knowledge MMLU | 0.878 | 7 |
Graduate-Level QA GPQA | 0.758 | 34 |
Overall Rank
#70
Coding Rank
-
Total Score
66
/ 100
Kimi K2-Base demonstrates strong transparency in its architectural specifications and parameter density, providing clear distinctions between total and active parameters. While the model's technical innovations and hardware requirements are well-documented, it suffers from significant opacity regarding its training data sources and compute resources. The transparency profile is further complicated by custom licensing terms and concerns regarding benchmark data integrity in its derivative models.
Architectural Provenance
Kimi K2-Base is explicitly documented as a Mixture-of-Experts (MoE) transformer with 1.04 trillion total parameters and 32 billion active parameters. The architecture is detailed in a technical report and GitHub documentation, specifying 61 layers, 384 experts (8 selected per token), and a hidden dimension of 7168. It utilizes Multi-head Latent Attention (MLA) and the SwiGLU activation function. A significant technical disclosure is the use of the 'MuonClip' optimizer with a 'qk-clip' mechanism to stabilize training at scale, which is a high level of transparency for a foundational model.
Dataset Composition
While Moonshot AI discloses the scale of the pre-training data (15.5 trillion tokens), the specific composition and sources remain largely opaque. Documentation mentions a 'diverse mixture of web text, books, code, and multilingual content' but lacks a granular percentage breakdown or specific source names. The mention of an 'agentic data synthesis pipeline' for post-training provides some insight into the methodology for instruction tuning, but the upstream pre-training data provenance is described only in vague, high-level terms.
Tokenizer Integrity
The tokenizer is publicly accessible via the model weights on Hugging Face and is well-documented. It features a large vocabulary size of 160,000 tokens, supporting multilingual text and code. Independent analysis by third parties (e.g., Unsloth) has verified its regex patterns and handling of Chinese characters, noting its similarity to GPT-4o's tokenizer with specific optimizations for Han characters. The EOS token and chat templates are clearly defined in the repository.
Parameter Density
Moonshot AI provides exemplary transparency regarding parameter density. They clearly distinguish between the 1 trillion total parameters and the 32 billion active parameters per token. The documentation further breaks down the expert structure (384 total experts, 8 active per token, 1 shared expert) and provides specific dimensions for the attention and MoE layers. This prevents the common 'parameter inflation' marketing trap often seen with MoE models.
Training Compute
Information regarding training compute is minimal. While unofficial reports and leaks suggest a training cost of approximately $4.6 million using H800 GPUs, Moonshot AI's official stance in AMAs is that costs are 'hard to quantify.' There is no public disclosure of total GPU/TPU hours, specific hardware cluster configurations, or the carbon footprint associated with the 15.5 trillion token training run.
Benchmark Reproducibility
The model provides a comprehensive list of benchmark results (MMLU, GPQA, SWE-bench, etc.) and a technical report. However, the evaluation code is not fully public, and exact prompts for all benchmarks are not disclosed. While they provide 'Best Practices for Benchmarking' documentation with recommended settings (temperature, top_p, max_tokens), the lack of a reproducible evaluation harness and the discovery of significant contamination in related 'Thinking' variants by independent labs (e.g., ETH Zurich) necessitates a cautious score.
Identity Consistency
Kimi K2-Base maintains a consistent identity across its documentation and API. It correctly identifies its versioning (K2 vs K2.5) and its nature as a foundational model. There are no documented instances of the base model claiming to be a competitor's model or misrepresenting its core MoE architecture. The model card clearly distinguishes between the Base, Instruct, and Thinking variants.
License Clarity
The model is released under a 'Modified MIT License.' The license is publicly available on GitHub and is largely permissive, allowing for research and commercial use. However, it includes a specific 'attribution' clause for high-scale commercial users (>100M MAU or >$20M monthly revenue), requiring prominent display of 'Kimi K2' on the UI. While clear, this custom modification moves it away from a standard Open Source definition.
Hardware Footprint
Hardware requirements are well-documented for various precisions. Official and third-party guides (e.g., ApX, RunPod) provide VRAM estimates for FP16 (~2.4TB), INT8 (~1TB), and INT4 (~600GB). The model was notably developed with 'Quantization-Aware Training' (QAT) for native INT4 support, and documentation explicitly discusses the memory trade-offs and the need for multi-GPU clusters (e.g., 8x H200/B200) to run the full 1T parameter model.
Versioning Drift
Moonshot AI uses date-based versioning (e.g., 0711, 0905) and maintains a basic changelog on Hugging Face for newer iterations like K2.5. However, for the K2-Base model, there is limited documentation regarding weight updates or silent changes. While they provide deprecation notices for certain tokens or system prompts in the Instruct/Thinking versions, the Base model's versioning history is less granular.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online