ApX logoApX logo

Kimi K2 Thinking

Active Parameters

1T

Context Length

256K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Modified MIT License

Release Date

7 Nov 2025

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

64

Key-Value Heads

64

Attention Head Dimension

-

Position Embedding

Absolute Position Embedding

RoPE Theta

50,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

7,168

Number of Layers

61

FFN Intermediate Size (Dense)

2,048

Multi-Token Prediction Heads

0

Tokenizer

Vocabulary Size

163,840

Mixture of Experts

Total Expert Parameters

32.0B

Number of Experts

384

Active Experts

8

Shared Experts

1

FFN Intermediate Size (per Expert)

2,048

Dense Layers Before MoE

1

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 7.2k · Context: 256k · Vocab: 163.8kx 61 layersRMSNormPre-AttentionMulti-Head Attention64Q / 64KV headsHead dim: 112+RMSNormPre-FFNSparse MoE FFN (8/384 experts)SwiGLUIntermediate: 2k+Final RMSNormOutput Logits

Kimi K2 Thinking

Kimi K2 Thinking is a language model developed by Moonshot AI, engineered as a specialized thinking agent designed to perform complex, multi-step reasoning and dynamic tool invocation. The model is trained to interleave chain-of-thought processes with function calls, enabling it to execute intricate workflows such as autonomous research, coding, and writing that can persist over hundreds of sequential actions without coherence degradation. A key design principle is its native INT4 quantization, which is applied via Quantization-Aware Training (QAT) to achieve efficient inference, contributing to lossless reductions in inference latency and GPU memory utilization.

Architecturally, Kimi K2 Thinking operates on a sparse Mixture-of-Experts (MoE) paradigm, encompassing a total of 1 trillion parameters, with 32 billion parameters activated per inference pass. The model's internal structure includes 61 layers and employs a Multi-Head Latent Attention (MLA) mechanism with 64 attention heads. The activation function utilized is SwiGLU, and it features a vocabulary size of 160,000 tokens. It incorporates 384 experts, selecting 8 experts per token during processing, and is optimized for persistent step-by-step reasoning within its architectural constraints.

The model is characterized by a substantial 256,000-token context window, allowing for the processing of extensive textual inputs, which is particularly beneficial for long-horizon tasks, complex debugging, or comprehensive document analysis. This extended context, combined with its robust tool orchestration capabilities, enables Kimi K2 Thinking to maintain stable goal-directed behavior across 200 to 300 consecutive tool invocations. This capacity addresses a common limitation in prior models, which often exhibit performance degradation after a significantly fewer number of sequential steps.

About Kimi K2

Moonshot AI's Kimi K2 is a Mixture-of-Experts model featuring one trillion total parameters, activating 32 billion per token. Designed for agentic intelligence, it utilizes a sparse architecture with 384 experts and the MuonClip optimizer for training stability, supporting a 128K token context window.


Other Kimi K2 Models

Evaluation Benchmarks

Rank

#56

BenchmarkScoreRank

Graduate-Level QA

GPQA

0.845

13

0.761

15

0.81

22

General Text

Text Arena

1451

25

Web Development

WebDev Arena

1430

26

0.52

30

0.63

32

Professional Knowledge

MMLU Pro

0.81

33

Agentic Coding

LiveBench Agentic

0.38

36

0.67

44

Rankings

Overall Rank

#56

Coding Rank

#54

Model Integrity

Total Score

B

64 / 100

Kimi K2 Thinking Model Integrity Report

Total Score

64

/ 100

B

Audit Note

Kimi K2 Thinking demonstrates strong transparency in its architectural specifications and parameter density, providing clear distinctions between its trillion-parameter scale and active compute. However, the model suffers from significant opacity regarding its training data composition and lacks a formal peer-reviewed technical paper to validate its training methodology. While the weights are accessible under a modified license, the lack of reproducible evaluation code and formal versioning history limits its overall transparency profile.

Upstream

19.5 / 30

Architectural Provenance

7.5 / 10

The model's architecture is explicitly documented as a sparse Mixture-of-Experts (MoE) transformer with 61 layers and 384 experts. Technical disclosures confirm the use of Multi-Head Latent Attention (MLA) with 64 attention heads and SwiGLU activation. While it is described as a reasoning-focused variant of the Kimi K2 family, the specific pre-training methodology and architectural lineage (noted by third parties as heavily influenced by DeepSeek-V3) are partially disclosed through technical blog posts and model cards, though a formal peer-reviewed paper is absent.

Dataset Composition

3.5 / 10

Moonshot AI discloses a total training volume of 15.5 trillion tokens for the Kimi K2 family. However, specific dataset composition (e.g., exact ratios of web, code, and books) and detailed data cleaning or filtering methodologies remain largely opaque. Claims of 'high-quality' and 'diverse' data are made without providing public access to sample data or comprehensive source breakdowns, which is a significant gap in transparency.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible on Hugging Face and integrated into major inference frameworks like vLLM. It uses a Tiktoken-based BPE approach with a clearly stated vocabulary size of 163,840 tokens. Documentation includes special token IDs (BOS/EOS) and chat templates, allowing for independent verification and alignment with the model's claimed language support.

Model

26.0 / 40

Parameter Density

9.0 / 10

Moonshot AI provides exemplary transparency regarding parameter density. The model is clearly defined as having 1 trillion total parameters with exactly 32 billion active parameters per token (selecting 8 experts out of 384). This distinction between total and active parameters is consistently maintained across official documentation, preventing the common 'parameter inflation' marketing trap.

Training Compute

4.0 / 10

Limited information is available regarding the training compute. While third-party reports and news sources estimate a training cost of approximately $4.6 million and roughly 2.8 million H800 GPU hours for the base model, Moonshot AI has not officially disclosed precise hardware utilization, carbon footprint calculations, or exact training duration in their primary documentation.

Benchmark Reproducibility

5.0 / 10

The model provides detailed benchmark results (e.g., 44.9% on HLE with tools, 71.3% on SWE-Bench Verified) and some recommended API settings for reproduction (temperature 1.0, top_p 0.95). However, the full evaluation code and exact prompt sets used for these internal benchmarks are not publicly hosted in a reproducible repository, and third-party audits have raised concerns regarding the consistency of these results.

Identity Consistency

8.0 / 10

The model maintains a consistent identity as 'Kimi K2 Thinking' and correctly identifies its role as a reasoning agent. It distinguishes itself from the non-thinking 'Instruct' variants and provides version-aware responses. There are no widespread reports of the model claiming to be a competitor's product (e.g., GPT-4), though its internal awareness of its specific training cutoff is not always precise.

Downstream

18.0 / 30

License Clarity

6.5 / 10

The model is released under a 'Modified MIT License.' While it allows for commercial use, it includes a restrictive clause requiring prominent UI attribution for entities exceeding 100 million monthly active users or $20 million in monthly revenue. This deviates from standard Open Source Definition (OSD) compliance, creating a 'semi-open' legal profile that requires careful legal review for large-scale enterprise adoption.

Hardware Footprint

7.0 / 10

Hardware requirements are well-documented for various deployment scenarios. Official and community guides specify VRAM needs for the native INT4 format (~594GB for weights) and provide scaling estimates for context window usage. Guidance is provided for running the model on enterprise clusters (8x H100/H200) as well as extreme quantization paths for consumer hardware, though accuracy tradeoffs for the latter are less formally documented.

Versioning Drift

4.5 / 10

Versioning follows a basic naming convention (Kimi K2 Thinking vs. Turbo), but a detailed, public-facing semantic changelog is missing. While new iterations like Kimi K2.5 are announced, there is no formal mechanism for users to track silent updates or behavior drift in the underlying API endpoints, making it difficult to maintain long-term production stability.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
125k
250k

VRAM Required:

Recommended GPUs

Kimi K2 Thinking: Specifications and GPU VRAM Requirements