ApX logoApX logo

Kimi K2-Instruct

Active Parameters

1T

Context Length

128K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Modified MIT License

Release Date

11 Jul 2025

Knowledge Cutoff

-

Technical Specifications

Total Expert Parameters

32.0B

Number of Experts

384

Active Experts

8

Attention Structure

Multi-Layer Attention

Hidden Dimension Size

7168

Number of Layers

61

Attention Heads

64

Key-Value Heads

-

Activation Function

SwigLU

Normalization

-

Position Embedding

ROPE

Kimi K2-Instruct

Kimi K2-Instruct is an advanced Mixture-of-Experts (MoE) language model developed by Moonshot AI. This model incorporates 1 trillion total parameters, with approximately 32 billion parameters activated during each inference pass. Its core purpose is to deliver state-of-the-art agentic intelligence, facilitating sophisticated tool utilization, advanced code generation, and autonomous problem-solving across various domains. As a post-trained instruction-following variant, Kimi K2-Instruct is optimized for general-purpose conversational tasks and complex agentic workflows, operating as a reflex-grade model designed for direct application.

The architectural design of Kimi K2-Instruct features a Mixture-of-Experts paradigm, leveraging 384 specialized experts, with 8 active experts dynamically selected per token during inference. The model comprises 61 layers and employs a Multi-head Local Attention (MLA) mechanism with 64 attention heads. A key innovation in its training methodology is the MuonClip optimizer, developed by Moonshot AI, which ensures training stability at the expansive scale of 15.5 trillion tokens. The architecture prioritizes long-context efficiency, supporting a substantial context window of 128,000 tokens. The activation function employed within the model is SwiGLU, complemented by Rotary Position Embeddings (RoPE).

Kimi K2-Instruct is engineered for demanding applications, including complex, multi-step reasoning tasks and analytical workflows that necessitate profound comprehension. Its capabilities encompass advanced code generation, ranging from foundational scripting to intricate software development and debugging, along with robust support for multilingual applications. The model exhibits strong tool-calling capabilities, enabling it to autonomously interpret user intentions and orchestrate external tools and APIs to accomplish intricate objectives. Practical use cases include automating development workflows, generating comprehensive data analysis reports, and facilitating interactive task planning by seamlessly integrating multiple external services.

About Kimi K2

Moonshot AI's Kimi K2 is a Mixture-of-Experts model featuring one trillion total parameters, activating 32 billion per token. Designed for agentic intelligence, it utilizes a sparse architecture with 384 experts and the MuonClip optimizer for training stability, supporting a 128K token context window.


Other Kimi K2 Models

Evaluation Benchmarks

Rank

#57

BenchmarkScoreRank

0.98

🥈

2

0.59

🥉

3

General Knowledge

MMLU

0.90

🥉

3

0.93

5

Professional Knowledge

MMLU Pro

0.83

7

0.74

16

Graduate-Level QA

GPQA

0.75

25

Agentic Coding

LiveBench Agentic

0.32

28

0.42

33

0.63

40

0.58

41

Rankings

Overall Rank

#57

Coding Rank

#17

Model Transparency

Total Score

B+

70 / 100

Kimi K2-Instruct Transparency Report

Total Score

70

/ 100

B+

Audit Note

Kimi K2-Instruct exhibits high transparency in its architectural specifications and hardware requirements, providing clear distinctions between total and active parameters. However, it remains opaque regarding its training data composition and the environmental costs of its massive compute requirements. While the model is accessible and well-documented for deployment, its benchmark reproducibility is hindered by limited disclosure of evaluation prompts and known data contamination issues.

Upstream

21.0 / 30

Architectural Provenance

8.5 / 10

Moonshot AI provides high transparency regarding the Kimi K2-Instruct architecture. It is explicitly identified as a Mixture-of-Experts (MoE) model with 1 trillion total parameters and 32 billion active parameters. Documentation details 61 layers, 384 experts (8 active per token), and the use of Multi-head Latent Attention (MLA) with 64 heads. The training methodology is well-documented, specifically citing the use of the proprietary MuonClip optimizer to maintain stability during the 15.5 trillion token pre-training phase. The model is clearly identified as a post-trained instruction-following variant of the Kimi-K2-Base model.

Dataset Composition

3.5 / 10

While the scale of the training data is disclosed (15.5 trillion tokens), the specific composition and sources remain largely opaque. Official documentation describes the data as 'diverse' and mentions 'simulated multi-step tool interactions' for agentic training, but lacks a granular breakdown of data types (e.g., percentage of code, web, or academic papers). There is no public disclosure of specific datasets used for pre-training or the exact filtering and cleaning methodologies employed, which is a significant gap in upstream transparency.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly accessible via the official Hugging Face repository and integrated into standard libraries like vLLM and SGLang. Technical specifications are clearly stated, including a vocabulary size of 160,000 tokens and support for a 128,000 token context window (extended to 256k in later revisions). The tokenizer implementation has been updated to fix specific bugs related to special token encoding and multi-turn tool call templates, with these changes documented in the public changelog.

Model

27.0 / 40

Parameter Density

9.5 / 10

Transparency regarding parameter density is exemplary for an MoE model. Moonshot AI explicitly distinguishes between the 1 trillion total parameters and the 32 billion parameters activated per token. The documentation further breaks down the architecture into 384 experts, 8 selected experts per token, and 1 shared expert. Detailed dimensions for attention (7168) and MoE hidden layers (2048) are provided, ensuring no ambiguity regarding the model's sparse nature or active compute requirements.

Training Compute

4.0 / 10

Information regarding training compute is limited to hardware types and general scale. While it is known the model was trained on large-scale GPU clusters (NVIDIA H100/H800/B200 mentioned in deployment contexts), the specific number of GPU hours, total energy consumption, and carbon footprint are not disclosed. The documentation focuses on the stability of the training process rather than the environmental or resource costs associated with the 15.5T token training run.

Benchmark Reproducibility

4.5 / 10

Moonshot AI reports extensive benchmark results (MMLU: 89.5%, SWE-bench Verified: 65.8%, MATH-500: 97.4%) and provides some evaluation parameters, such as output token lengths. However, the full evaluation code and exact prompts used for all benchmarks are not fully public. While third-party verification from platforms like Artificial Analysis exists, the lack of a comprehensive, reproducible evaluation suite in the official repository prevents a higher score. (Score adjusted for disclosed contamination issues).

Identity Consistency

9.0 / 10

The model demonstrates strong identity consistency, correctly identifying itself as 'Kimi, an AI assistant created by Moonshot AI' in official system prompts and API documentation. It maintains clear versioning (e.g., v1.0, 0711-preview, 0905-revision) and distinguishes itself from its 'Thinking' and 'Base' variants. There are no documented instances of the model claiming to be a competitor's product or misrepresenting its core identity.

Downstream

22.0 / 30

License Clarity

7.5 / 10

The model is released under a 'Modified MIT License.' While this permits broad use, including commercial application, it includes specific restrictions (e.g., attribution requirements for products with >100M users or >$20M monthly revenue). The license is clearly stated in the GitHub and Hugging Face repositories, but the 'modified' nature introduces some complexity compared to standard open-source licenses like Apache 2.0.

Hardware Footprint

8.0 / 10

Hardware requirements are thoroughly documented for various deployment scenarios. Official guidance specifies the need for approximately 1.1 TB of VRAM for FP8 weights and provides minimum configurations for quantized versions (e.g., 2-bit requiring ~381 GB). The documentation includes specific GPU cluster recommendations (e.g., 16x H800 for 128k context) and integrates with multiple inference engines (vLLM, SGLang, KTransformers), offering clear guidance on VRAM and system RAM trade-offs.

Versioning Drift

6.5 / 10

Moonshot AI maintains a public changelog on Hugging Face, documenting updates such as chat template improvements and tokenizer bug fixes. Semantic-style versioning is used for major releases (K2, K2.5) and specific checkpoints (0711, 0905). However, the frequency of 'silent' updates to the API endpoints and the lack of detailed deprecation paths for older weights prevent a higher score in this category.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs

Kimi K2-Instruct: Specifications and GPU VRAM Requirements