Active Parameters
357B
Context Length
200K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
30 Sept 2025
Knowledge Cutoff
-
Total Expert Parameters
32.0B
Number of Experts
-
Active Experts
-
Attention Structure
Multi-Head Attention
Hidden Dimension Size
5120
Number of Layers
-
Attention Heads
96
Key-Value Heads
-
Activation Function
-
Normalization
-
Position Embedding
Absolute Position Embedding
GLM-4.6 is a large language model developed by Z.ai, designed to facilitate advanced applications in artificial intelligence. This model is engineered to operate efficiently across a spectrum of complex tasks, including sophisticated coding, extended context processing, and agentic operations. Its bilingual capabilities, supporting both English and Chinese, extend its applicability across diverse linguistic contexts. The model’s purpose is to serve as a robust foundation for building intelligent systems capable of nuanced reasoning and autonomous interaction.
Architecturally, GLM-4.6 implements a Mixture-of-Experts (MoE) configuration, incorporating 357 billion total parameters, with 32 billion parameters actively utilized during a given forward pass. The model's design features a context window expanded to 200,000 tokens, enabling it to process and maintain coherence over substantial input sequences. Innovations within its attention mechanism include Grouped-Query Attention (GQA) with 96 attention heads, and the integration of a partial Rotary Position Embedding (RoPE) for positional encoding. Normalization is managed through QK-Norm, contributing to stabilized attention logits. These architectural choices aim to balance computational efficiency with enhanced performance in complex cognitive operations.
The operational characteristics of GLM-4.6 are optimized for real-world development workflows. It demonstrates superior coding performance, leading to more visually polished front-end generation and improved real-world application results. The model exhibits enhanced reasoning capabilities, which are further augmented by its integrated tool-use functionality during inference. This facilitates the creation of more capable agents proficient in search-based tasks and role-playing scenarios. Furthermore, GLM-4.6 achieves improved token efficiency, completing tasks with approximately 15% fewer tokens compared to its predecessor, GLM-4.5, thereby offering a more cost-effective inference profile.
GLM-4 is a series of bilingual (English and Chinese) language models developed by Zhipu AI. The models feature extended context windows, superior coding performance, advanced reasoning capabilities, and strong agent functionalities. GLM-4.6 offers improvements in tool use and search-based agents.
Rank
#36
| Benchmark | Score | Rank |
|---|---|---|
Graduate-Level QA GPQA | 0.81 | 13 |
Mathematics LiveBench Mathematics | 0.81 | 14 |
Data Analysis LiveBench Data Analysis | 0.72 | 16 |
Reasoning LiveBench Reasoning | 0.62 | 20 |
Agentic Coding LiveBench Agentic | 0.35 | 24 |
Coding LiveBench Coding | 0.71 | 25 |
Overall Rank
#36
Coding Rank
#52
Total Score
66
/ 100
GLM-4.6 exhibits strong transparency in its architectural disclosure and licensing, providing clear distinctions between total and active parameters in its MoE design. While the model offers public access to weights and detailed agentic evaluation trajectories, it remains opaque regarding its training data sources and the specific compute resources utilized. Significant gaps exist in documenting the dataset's composition and the long-term stability of model behavior across minor version updates.
Architectural Provenance
GLM-4.6 is explicitly documented as a Mixture-of-Experts (MoE) transformer model, evolving from the GLM-4.5 architecture. Key architectural details are public, including the use of Grouped-Query Attention (GQA) with 96 heads, QK-Norm for stability, and a partial Rotary Position Embedding (RoPE). The model's transition to a single-stage Reinforcement Learning (RL) pipeline and the 'SLIME' framework for agentic training are also disclosed in technical presentations. However, while the high-level methodology is clear, the specific layer-by-layer configuration and the exact 'partial' nature of the RoPE implementation lack granular technical specifications in the primary model card.
Dataset Composition
Information regarding the training data is limited to high-level marketing descriptions. The provider mentions a '15 trillion token' pre-training corpus and highlights the inclusion of 'repo-level code contexts' and 'agentic reasoning data.' However, there is no public breakdown of the dataset's composition (e.g., specific percentages of web, code, or books), no disclosure of specific data sources, and no detailed documentation on the filtering or cleaning methodologies used to curate the 15T tokens.
Tokenizer Integrity
The tokenizer is publicly accessible via the official Hugging Face repository and GitHub. It supports a 200K context window and is optimized for bilingual (English/Chinese) tasks. Technical documentation notes a 15% improvement in token efficiency over its predecessor, GLM-4.5. The vocabulary and tokenization approach are verifiable through the provided source code, though detailed alignment studies between the tokenizer and the specific 15T token training set are not fully documented.
Parameter Density
The model provides exemplary transparency regarding its MoE structure, clearly distinguishing between the 357 billion total parameters and the 32 billion active parameters utilized during a forward pass. This prevents the common 'parameter inflation' seen in MoE marketing. The architectural breakdown (MoE with 32B active) is consistent across official documentation and third-party technical reviews.
Training Compute
There is almost no verifiable information regarding the specific compute resources used for training GLM-4.6. While the provider mentions the 'SLIME' framework for efficient RL, they do not disclose GPU/TPU hours, hardware cluster specifications, training duration, or the carbon footprint. Claims of 'efficiency' are made without the underlying compute data necessary for verification.
Benchmark Reproducibility
Z.ai provides a significant amount of evaluation data, including the public release of 'CC-Bench' trajectories (prompts, tool calls, and multi-turn logs) on Hugging Face to allow for scrutiny of agentic performance. They also report scores on standard benchmarks like AIME 25 and LiveCodeBench v6. However, the scoring is penalized due to significant performance discrepancies reported by third-party testers on long-context tasks compared to official claims, and the lack of a unified, one-click reproduction script for all cited benchmarks.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as GLM-4.6 and maintaining version awareness. It does not attempt to mimic competitor identities (like GPT-4) in its weights or system prompts. Documentation clearly distinguishes between the base model and its multimodal (4.6V) or reasoning-specific variants, ensuring users know exactly which version they are interacting with.
License Clarity
The model weights and source code are released under the highly permissive MIT license, which is clearly stated on Hugging Face and GitHub. This license explicitly allows for commercial use, modification, and distribution with minimal restrictions. There are no known conflicting terms between the weight license and the inference code.
Hardware Footprint
Hardware requirements are well-documented for various use cases. Official guides specify that standard inference in FP8 requires 8x H100 or 4x H200 GPUs, while the full 200K context requires 16x H100s. Third-party quantization (GGUF/Ollama) provides additional data on VRAM needs for Q2 through Q8 precision levels. The only gap is the lack of official documentation on the specific accuracy-performance trade-offs for these different quantization levels.
Versioning Drift
The model follows a clear versioning path (4.5 to 4.6 to 4.7), and changelogs highlight major improvements like context expansion and token efficiency. However, there is a lack of detailed documentation regarding 'silent' updates to the safety filters or alignment layers, which users have noted can affect behavior over time. There is no formal system for accessing specific sub-versions or 'snapshots' once a new iteration is pushed to the main API.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens