ApX logoApX logo

GLM-5.1

Active Parameters

754B

Context Length

200K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

MIT

Release Date

7 Apr 2026

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Layer Attention

Attention Heads

64

Key-Value Heads

64

Attention Head Dimension

64

Position Embedding

ROPE

RoPE Theta

1,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

6,144

Number of Layers

78

FFN Intermediate Size (Dense)

2,048

Multi-Token Prediction Heads

1

Tokenizer

Vocabulary Size

154,880

Mixture of Experts

Total Expert Parameters

40.0B

Number of Experts

257

Active Experts

9

Shared Experts

1

FFN Intermediate Size (per Expert)

2,048

Dense Layers Before MoE

3

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 6.1k · Context: 200K · Vocab: 154.9kx 78 layersRMSNormPre-AttentionMulti-Layer Attention64Q / 64KV headsHead dim: 64+RMSNormPre-FFNSparse MoE FFN (9/257 experts)SwiGLUIntermediate: 2k+Final RMSNormOutput Logits

GLM-5.1

GLM-5.1 is Z.ai's flagship model for long-horizon agentic coding tasks. Built on a novel GlmMoeDSA architecture with 754B total parameters (256 routed + 1 shared experts, 8+1 active per token) across 78 layers, it combines Gated DeltaNet linear attention with standard attention and sparse MoE feed-forward networks — enabling efficient inference while delivering top-tier intelligence. Achieves state-of-the-art 58.4% on SWE-Bench Pro, 63.5% on Terminal-Bench 2.0, 95.3% on AIME 2026, and 86.2% on GPQA-Diamond. Uniquely designed for 8-hour sustained autonomous execution — breaking complex engineering tasks into iterative experiment-analyze-optimize loops. Supports 200K context window and 128K max output tokens. Available via API as glm-5.1 on Z.ai and BigModel.cn. Released April 7, 2026 under MIT license.

About GLM-5.1

GLM-5.1 is Z.ai's next-generation flagship model for agentic engineering, built on a novel hybrid MoE architecture (GlmMoeDSA) combining Gated DeltaNet linear attention layers with standard attention and sparse MoE feed-forward networks. It achieves state-of-the-art performance on SWE-Bench Pro (58.4%) and is designed for long-horizon autonomous tasks, capable of sustained execution for up to 8 hours. With 754B total parameters and a 200K context window, GLM-5.1 delivers strong performance across coding, reasoning, tool use, and agentic benchmarks. Released open-source under the MIT License.


Other GLM-5.1 Models
  • No related models available

Evaluation Benchmarks

Rank

#5

BenchmarkScoreRank

Web Development

WebDev Arena

1532

7

General Text

Text Arena

1475

7

Rankings

Overall Rank

#5

Coding Rank

#18

Model Integrity

Total Score

B

68 / 100

GLM-5.1 Model Integrity Report

Total Score

68

/ 100

B

Audit Note

GLM-5.1 exhibits strong transparency in licensing and architectural configuration, providing clear details on its Mixture-of-Experts structure and permissive MIT license. However, significant gaps remain regarding the specific composition of its 28.5T token training set and the total compute resources consumed during training. While benchmark performance is well-documented, the lack of full evaluation code limits independent reproducibility.

Upstream

21.0 / 30

Architectural Provenance

8.0 / 10

GLM-5.1 is explicitly documented as a post-training refinement of the GLM-5 base model. The architecture, 'GlmMoeDSA', is a sophisticated hybrid combining Mixture-of-Experts (MoE) with DeepSeek Sparse Attention (DSA) and Gated DeltaNet linear attention. Technical reports and GitHub documentation detail the use of 78 layers and a specific configuration of 256 routed experts plus 1 shared expert. The transition from GLM-5's 744B to 754B parameters is noted as being driven by architectural optimizations for long-horizon agentic tasks. While the high-level methodology is clear, the specific weights of the hybrid attention blending are not fully disclosed.

Dataset Composition

4.0 / 10

Information regarding the training data is limited to high-level metrics. Official sources state the model was pre-trained on 28.5 trillion tokens, an increase from the 23T used for GLM-4.5. However, the specific breakdown of data sources (e.g., proportions of code, web, or academic data) is not publicly disclosed. Documentation explicitly lists data collection and labeling methodologies as 'Undisclosed' in technical specifications, though it mentions the use of multi-turn SFT and RL for post-training.

Tokenizer Integrity

9.0 / 10

The model uses the 'Tekken' tokenizer, which is publicly accessible via the official GitHub repository and Hugging Face. It features a vocabulary size of 131,072 tokens and is documented to support 200K context windows with 128K max output tokens. The tokenizer is compatible with standard runtimes like vLLM and Transformers, allowing for independent verification of tokenization behavior and language support alignment.

Model

24.0 / 40

Parameter Density

7.0 / 10

The model's parameter density is well-documented: it features 754 billion total parameters with 40 billion active parameters per token (8 routed experts + 1 shared expert). The architectural breakdown of 256 total experts is clearly stated. However, while the total and active counts are provided, the specific parameter distribution between the Gated DeltaNet linear attention layers and standard attention layers is less granularly detailed in public documentation.

Training Compute

3.0 / 10

Compute transparency is low. While it is disclosed that the model was trained entirely on Huawei Ascend chips using a novel asynchronous RL infrastructure called 'slime', specific hardware hours, total GPU/TPU days, and carbon footprint data are absent. There are no public estimates of the total training cost or energy consumption provided by Z.ai.

Benchmark Reproducibility

5.0 / 10

Z.ai provides comprehensive results across major benchmarks including SWE-Bench Pro (58.4%), Terminal-Bench 2.0 (63.5%), and AIME 2026 (95.3%). While the benchmarks are named and versions are often specified, the exact evaluation code and full prompt sets used to achieve these specific scores are not fully public. Third-party verification is limited to API-based testing on platforms like OpenRouter, rather than full independent reproduction of the training-to-eval pipeline.

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying itself as GLM-5.1 and maintaining awareness of its versioning and specific focus on agentic engineering. It does not exhibit confusion with competitor models in official documentation or API deployments. It is transparent about its limitations, such as being a text-only model despite the existence of multimodal variants like GLM-5V.

Downstream

23.0 / 30

License Clarity

10.0 / 10

The model is released under the MIT License, which is one of the most permissive and clear open-source licenses available. This allows for unrestricted commercial use, modification, and distribution. There are no conflicting proprietary terms found in the official weight release on Hugging Face or the source code on GitHub.

Hardware Footprint

7.0 / 10

Hardware requirements are documented for the 1.51TB FP16 weights. Documentation and community resources provide guidance on running the model via quantization (GGUF, EXL2) on various hardware configurations, including multi-GPU setups. While VRAM requirements for the full model are clear (requiring enterprise-grade clusters), more detailed documentation on the accuracy-performance tradeoffs for specific quantization levels (Q4/Q8) would improve this score.

Versioning Drift

6.0 / 10

Z.ai uses a clear versioning scheme (GLM-5 to GLM-5.1) and provides changelogs highlighting the 28% coding performance improvement and the introduction of 'thinking mode'. However, the frequency of silent updates to the API endpoints and the availability of long-term support for older versions are not fully transparent, making it difficult for developers to guarantee long-term behavior stability.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
98k
195k

VRAM Required:

Recommended GPUs

GLM-5.1: Specifications and GPU VRAM Requirements