Active Parameters
754B
Context Length
200K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
7 Apr 2026
Knowledge Cutoff
-
Attention
Attention Structure
Multi-Layer Attention
Attention Heads
64
Key-Value Heads
64
Attention Head Dimension
64
Position Embedding
ROPE
RoPE Theta
1,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
6,144
Number of Layers
78
FFN Intermediate Size (Dense)
2,048
Multi-Token Prediction Heads
1
Tokenizer
Vocabulary Size
154,880
Mixture of Experts
Total Expert Parameters
40.0B
Number of Experts
257
Active Experts
9
Shared Experts
1
FFN Intermediate Size (per Expert)
2,048
Dense Layers Before MoE
3
GLM-5.1 is Z.ai's flagship model for long-horizon agentic coding tasks. Built on a novel GlmMoeDSA architecture with 754B total parameters (256 routed + 1 shared experts, 8+1 active per token) across 78 layers, it combines Gated DeltaNet linear attention with standard attention and sparse MoE feed-forward networks — enabling efficient inference while delivering top-tier intelligence. Achieves state-of-the-art 58.4% on SWE-Bench Pro, 63.5% on Terminal-Bench 2.0, 95.3% on AIME 2026, and 86.2% on GPQA-Diamond. Uniquely designed for 8-hour sustained autonomous execution — breaking complex engineering tasks into iterative experiment-analyze-optimize loops. Supports 200K context window and 128K max output tokens. Available via API as glm-5.1 on Z.ai and BigModel.cn. Released April 7, 2026 under MIT license.
GLM-5.1 is Z.ai's next-generation flagship model for agentic engineering, built on a novel hybrid MoE architecture (GlmMoeDSA) combining Gated DeltaNet linear attention layers with standard attention and sparse MoE feed-forward networks. It achieves state-of-the-art performance on SWE-Bench Pro (58.4%) and is designed for long-horizon autonomous tasks, capable of sustained execution for up to 8 hours. With 754B total parameters and a 200K context window, GLM-5.1 delivers strong performance across coding, reasoning, tool use, and agentic benchmarks. Released open-source under the MIT License.
Rank
#5
| Benchmark | Score | Rank |
|---|---|---|
Web Development WebDev Arena | 1532 | ⭐ 7 |
General Text Text Arena | 1475 | ⭐ 7 |
Overall Rank
#5
Coding Rank
#18
Total Score
68
/ 100
GLM-5.1 exhibits strong transparency in licensing and architectural configuration, providing clear details on its Mixture-of-Experts structure and permissive MIT license. However, significant gaps remain regarding the specific composition of its 28.5T token training set and the total compute resources consumed during training. While benchmark performance is well-documented, the lack of full evaluation code limits independent reproducibility.
Architectural Provenance
GLM-5.1 is explicitly documented as a post-training refinement of the GLM-5 base model. The architecture, 'GlmMoeDSA', is a sophisticated hybrid combining Mixture-of-Experts (MoE) with DeepSeek Sparse Attention (DSA) and Gated DeltaNet linear attention. Technical reports and GitHub documentation detail the use of 78 layers and a specific configuration of 256 routed experts plus 1 shared expert. The transition from GLM-5's 744B to 754B parameters is noted as being driven by architectural optimizations for long-horizon agentic tasks. While the high-level methodology is clear, the specific weights of the hybrid attention blending are not fully disclosed.
Dataset Composition
Information regarding the training data is limited to high-level metrics. Official sources state the model was pre-trained on 28.5 trillion tokens, an increase from the 23T used for GLM-4.5. However, the specific breakdown of data sources (e.g., proportions of code, web, or academic data) is not publicly disclosed. Documentation explicitly lists data collection and labeling methodologies as 'Undisclosed' in technical specifications, though it mentions the use of multi-turn SFT and RL for post-training.
Tokenizer Integrity
The model uses the 'Tekken' tokenizer, which is publicly accessible via the official GitHub repository and Hugging Face. It features a vocabulary size of 131,072 tokens and is documented to support 200K context windows with 128K max output tokens. The tokenizer is compatible with standard runtimes like vLLM and Transformers, allowing for independent verification of tokenization behavior and language support alignment.
Parameter Density
The model's parameter density is well-documented: it features 754 billion total parameters with 40 billion active parameters per token (8 routed experts + 1 shared expert). The architectural breakdown of 256 total experts is clearly stated. However, while the total and active counts are provided, the specific parameter distribution between the Gated DeltaNet linear attention layers and standard attention layers is less granularly detailed in public documentation.
Training Compute
Compute transparency is low. While it is disclosed that the model was trained entirely on Huawei Ascend chips using a novel asynchronous RL infrastructure called 'slime', specific hardware hours, total GPU/TPU days, and carbon footprint data are absent. There are no public estimates of the total training cost or energy consumption provided by Z.ai.
Benchmark Reproducibility
Z.ai provides comprehensive results across major benchmarks including SWE-Bench Pro (58.4%), Terminal-Bench 2.0 (63.5%), and AIME 2026 (95.3%). While the benchmarks are named and versions are often specified, the exact evaluation code and full prompt sets used to achieve these specific scores are not fully public. Third-party verification is limited to API-based testing on platforms like OpenRouter, rather than full independent reproduction of the training-to-eval pipeline.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as GLM-5.1 and maintaining awareness of its versioning and specific focus on agentic engineering. It does not exhibit confusion with competitor models in official documentation or API deployments. It is transparent about its limitations, such as being a text-only model despite the existence of multimodal variants like GLM-5V.
License Clarity
The model is released under the MIT License, which is one of the most permissive and clear open-source licenses available. This allows for unrestricted commercial use, modification, and distribution. There are no conflicting proprietary terms found in the official weight release on Hugging Face or the source code on GitHub.
Hardware Footprint
Hardware requirements are documented for the 1.51TB FP16 weights. Documentation and community resources provide guidance on running the model via quantization (GGUF, EXL2) on various hardware configurations, including multi-GPU setups. While VRAM requirements for the full model are clear (requiring enterprise-grade clusters), more detailed documentation on the accuracy-performance tradeoffs for specific quantization levels (Q4/Q8) would improve this score.
Versioning Drift
Z.ai uses a clear versioning scheme (GLM-5 to GLM-5.1) and provides changelogs highlighting the 28% coding performance improvement and the introduction of 'thinking mode'. However, the frequency of silent updates to the API endpoints and the availability of long-term support for older versions are not fully transparent, making it difficult for developers to guarantee long-term behavior stability.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online