Kimi K2-Base

Open Source

Open Weights

Active Parameters

Context Length

128K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Modified MIT License

Release Date

11 Jul 2025

Knowledge Cutoff

System Requirements

VRAM requirements for different quantization methods and context sizes

1,024 tokens

2103.65 GB VRAM

Consumer

140x RTX 4090

24GB VRAM

Datacenter

34x NVIDIA A100

80GB VRAM

Apple Silicon

30x Apple M3 Max

128GB VRAM

128,000 tokens

2370.15 GB VRAM

Consumer

162x RTX 4090

24GB VRAM

Datacenter

39x NVIDIA A100

80GB VRAM

Apple Silicon

35x Apple M3 Max

128GB VRAM

Architecture Diagram

Evaluation Benchmarks

Rank

#70

Benchmark	Score	Rank
Summarization ProLLM Summarization	0.93	6
General Knowledge MMLU	0.878	7
Graduate-Level QA GPQA	0.758	34

Rankings

Overall Rank

#70

Coding Rank

About Kimi K2-Base

Kimi K2-Base is a foundational large language model developed by Moonshot AI, designed for researchers and developers who require a customizable base for specific applications. It is engineered to facilitate agentic tasks, encompassing advanced code generation, multi-step problem-solving, and the autonomous utilization of external tools and APIs. This model provides a robust platform for developing tailored AI systems across diverse domains, such as legal analysis, scientific research, and specialized conversational interfaces.

Architecturally, Kimi K2-Base is a Mixture-of-Experts (MoE) transformer model. It comprises a total of 1 trillion parameters, with 32 billion parameters activated during each inference. The architecture integrates 384 specialized experts, with 8 experts dynamically selected per token to process inputs. A key innovation in its development is the MuonClip optimizer, proprietary to Moonshot AI, which addresses training instability in large-scale models by mitigating exploding attention logits. The model's internal structure includes 61 layers, an attention hidden dimension of 7168, and employs 64 attention heads along with SwiGLU activation functions.

The Kimi K2-Base model supports a substantial context window of 128,000 tokens, allowing it to process and analyze extended inputs and multi-turn interactions effectively. This design contributes to its efficiency in inference and makes it suitable for applications requiring extensive contextual understanding. Its optimization for agentic intelligence signifies its capability to interpret goals and execute complex workflows without continuous human intervention. The model was pre-trained on an extensive dataset of 15.5 trillion tokens, supporting its performance across various knowledge, reasoning, and coding tasks.

Technical Specifications

Attention

Attention Structure

Multi-Layer Attention

Attention Heads

Key-Value Heads

Attention Head Dimension

Position Embedding

ROPE

RoPE Theta

50,000

Sliding Window Attention

Sliding Window Size

Sliding Window Ratio

Linear Attention

Linear Attention Ratio

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

7,168

Number of Layers

FFN Intermediate Size (Dense)

2,048

Multi-Token Prediction Heads

Tokenizer

Vocabulary Size

163,840

Mixture of Experts

Total Expert Parameters

32.0B

Number of Experts

384

Active Experts

Shared Experts

FFN Intermediate Size (per Expert)

2,048

Dense Layers Before MoE

Model Integrity

Total Score

66 / 100

Upstream

20.5 / 30

Model

26.0 / 40

Downstream

19.5 / 30

Kimi K2-Base Model Integrity Report

Total Score

/ 100

Audit Note

Kimi K2-Base demonstrates strong transparency in its architectural specifications and parameter density, providing clear distinctions between total and active parameters. While the model's technical innovations and hardware requirements are well-documented, it suffers from significant opacity regarding its training data sources and compute resources. The transparency profile is further complicated by custom licensing terms and concerns regarding benchmark data integrity in its derivative models.

Upstream

20.5 / 30

Architectural Provenance

8.0 / 10

Kimi K2-Base is explicitly documented as a Mixture-of-Experts (MoE) transformer with 1.04 trillion total parameters and 32 billion active parameters. The architecture is detailed in a technical report and GitHub documentation, specifying 61 layers, 384 experts (8 selected per token), and a hidden dimension of 7168. It utilizes Multi-head Latent Attention (MLA) and the SwiGLU activation function. A significant technical disclosure is the use of the 'MuonClip' optimizer with a 'qk-clip' mechanism to stabilize training at scale, which is a high level of transparency for a foundational model.

Dataset Composition

4.0 / 10

While Moonshot AI discloses the scale of the pre-training data (15.5 trillion tokens), the specific composition and sources remain largely opaque. Documentation mentions a 'diverse mixture of web text, books, code, and multilingual content' but lacks a granular percentage breakdown or specific source names. The mention of an 'agentic data synthesis pipeline' for post-training provides some insight into the methodology for instruction tuning, but the upstream pre-training data provenance is described only in vague, high-level terms.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the model weights on Hugging Face and is well-documented. It features a large vocabulary size of 160,000 tokens, supporting multilingual text and code. Independent analysis by third parties (e.g., Unsloth) has verified its regex patterns and handling of Chinese characters, noting its similarity to GPT-4o's tokenizer with specific optimizations for Han characters. The EOS token and chat templates are clearly defined in the repository.

Model

26.0 / 40

Parameter Density

9.0 / 10

Moonshot AI provides exemplary transparency regarding parameter density. They clearly distinguish between the 1 trillion total parameters and the 32 billion active parameters per token. The documentation further breaks down the expert structure (384 total experts, 8 active per token, 1 shared expert) and provides specific dimensions for the attention and MoE layers. This prevents the common 'parameter inflation' marketing trap often seen with MoE models.

Training Compute

3.0 / 10

Information regarding training compute is minimal. While unofficial reports and leaks suggest a training cost of approximately $4.6 million using H800 GPUs, Moonshot AI's official stance in AMAs is that costs are 'hard to quantify.' There is no public disclosure of total GPU/TPU hours, specific hardware cluster configurations, or the carbon footprint associated with the 15.5 trillion token training run.

Benchmark Reproducibility

5.0 / 10

The model provides a comprehensive list of benchmark results (MMLU, GPQA, SWE-bench, etc.) and a technical report. However, the evaluation code is not fully public, and exact prompts for all benchmarks are not disclosed. While they provide 'Best Practices for Benchmarking' documentation with recommended settings (temperature, top_p, max_tokens), the lack of a reproducible evaluation harness and the discovery of significant contamination in related 'Thinking' variants by independent labs (e.g., ETH Zurich) necessitates a cautious score.

Identity Consistency

9.0 / 10

Kimi K2-Base maintains a consistent identity across its documentation and API. It correctly identifies its versioning (K2 vs K2.5) and its nature as a foundational model. There are no documented instances of the base model claiming to be a competitor's model or misrepresenting its core MoE architecture. The model card clearly distinguishes between the Base, Instruct, and Thinking variants.

Downstream

19.5 / 30

License Clarity

7.5 / 10

The model is released under a 'Modified MIT License.' The license is publicly available on GitHub and is largely permissive, allowing for research and commercial use. However, it includes a specific 'attribution' clause for high-scale commercial users (>100M MAU or >$20M monthly revenue), requiring prominent display of 'Kimi K2' on the UI. While clear, this custom modification moves it away from a standard Open Source definition.

Hardware Footprint

7.0 / 10

Hardware requirements are well-documented for various precisions. Official and third-party guides (e.g., ApX, RunPod) provide VRAM estimates for FP16 (~2.4TB), INT8 (~1TB), and INT4 (~600GB). The model was notably developed with 'Quantization-Aware Training' (QAT) for native INT4 support, and documentation explicitly discusses the memory trade-offs and the need for multi-GPU clusters (e.g., 8x H200/B200) to run the full 1T parameter model.

Versioning Drift

5.0 / 10

Moonshot AI uses date-based versioning (e.g., 0711, 0905) and maintains a basic changelog on Hugging Face for newer iterations like K2.5. However, for the K2-Base model, there is limited documentation regarding weight updates or silent changes. While they provide deprecation notices for certain tokens or system prompts in the Instruct/Thinking versions, the Base model's versioning history is less granular.

Resources

Official Documentation Download Weights

About Kimi K2

Moonshot AI's Kimi K2 is a Mixture-of-Experts model featuring one trillion total parameters, activating 32 billion per token. Designed for agentic intelligence, it utilizes a sparse architecture with 384 experts and the MuonClip optimizer for training stability, supporting a 128K token context window.

Kimi K2-Base

System Requirements

Architecture Diagram

Evaluation Benchmarks

Rankings

About Kimi K2-Base

Technical Specifications

Model Integrity

Kimi K2-Base Model Integrity Report

Audit Note

Upstream

Model

Downstream

Resources

About Kimi K2

Other Kimi K2 Models