Active Parameters
744B
Context Length
204.8K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
MIT
Release Date
12 Feb 2026
Knowledge Cutoff
Dec 2025
Total Expert Parameters
40.0B
Number of Experts
256
Active Experts
8
Attention Structure
Multi-Head Attention
Hidden Dimension Size
-
Number of Layers
80
Attention Heads
-
Key-Value Heads
-
Activation Function
-
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
GLM-5 is a flagship multimodal foundation model developed by Z.ai, designed for complex systems engineering and long-horizon agentic workflows. Utilizing a Mixture-of-Experts (MoE) architecture, the model scales to 744 billion total parameters with approximately 40 billion parameters activated per token. This design facilitates high-capacity reasoning and specialized knowledge retrieval while maintaining the computational efficiency required for large-scale deployment. The model is trained on a massive 28.5 trillion token corpus, emphasizing high-quality code, technical documentation, and reasoning-dense data to support professional-grade software development and autonomous problem-solving.
Technically, GLM-5 introduces several architectural innovations, most notably the integration of DeepSeek Sparse Attention (DSA). This mechanism optimizes the standard attention block by dynamically allocating computational resources, which significantly reduces the memory and compute overhead associated with processing long sequences. Additionally, the model leverages an asynchronous reinforcement learning infrastructure known as 'slime' during post-training. This framework decouples generation from training to improve iteration throughput, allowing the model to learn effectively from complex, multi-step interactions and dynamic environments.
Optimized for long-context stability, GLM-5 supports a context window of up to 204,800 tokens and is capable of generating up to 128,000 tokens in a single output. Its operational capabilities include advanced tool-use, real-time streaming, and structured output across frontend, backend, and data processing tasks. The model is released with open weights under the MIT License, enabling researchers and developers to perform local serving, fine-tuning, and integration into diverse agentic frameworks without vendor lock-in.
GLM 5 is the fifth generation of General Language Models developed by Z.ai. It represents a significant leap in multimodal foundational capabilities, featuring advanced reasoning and long-horizon agentic capabilities across diverse systems engineering tasks.
Rank
#16
| Benchmark | Score | Rank |
|---|---|---|
Agentic Coding LiveBench Agentic | 0.55 | 🥉 3 |
Web Development WebDev Arena | 1455 | ⭐ 6 |
Mathematics LiveBench Mathematics | 0.83 | 10 |
Reasoning LiveBench Reasoning | 0.69 | 15 |
Overall Rank
#16
Coding Rank
#15
Total Score
79
/ 100
GLM-5 exhibits a high level of transparency, particularly regarding its complex MoE architecture and licensing. The technical documentation provides an unusually detailed breakdown of expert routing and attention mechanisms. While it excels in architectural and legal clarity, it remains less transparent about the specific environmental impact and total compute hours utilized during its massive 28.5T token training run.
Architectural Provenance
GLM-5 is extensively documented in the technical report 'GLM-5: from Vibe Coding to Agentic Engineering' (arXiv:2602.11354). The architecture is a Mixture-of-Experts (MoE) transformer with 80 layers and 256 experts. It introduces DeepSeek Sparse Attention (DSA) and Multi-Head Latent Attention (MLA) to optimize long-context performance. The post-training methodology, specifically the 'slime' asynchronous reinforcement learning infrastructure, is detailed with clear explanations of how generation and training are decoupled to improve iteration throughput.
Dataset Composition
The model was trained on a 28.5 trillion token corpus. Documentation specifies a two-stage process: general pre-training (~27T tokens) and a mid-training phase for long-context and agentic data. The technical report provides a breakdown of data sources including web (refined via DCLM classifiers), code (Software Heritage snapshots), and math/science (papers and books). While specific percentage breakdowns for every category are not provided in a single table, the filtering and cleaning methodologies (PPL, deduplication, and LLM-based scoring) are well-documented.
Tokenizer Integrity
The tokenizer is publicly available via the official GitHub and Hugging Face repositories. It features a vocabulary size of 154,880 tokens. The tokenization approach is consistent with the GLM family's bilingual (Chinese/English) and code-heavy focus, and its integration is verified through public inference engines like vLLM and SGLang which support the model natively.
Parameter Density
Z.ai provides precise transparency regarding parameter density. The model contains 744 billion total parameters, with exactly 40 billion parameters activated per token. The architectural breakdown is highly detailed: 80 layers total, consisting of 3 dense layers and 75 MoE layers, with 256 experts (8 routed and 1 shared per MoE layer). This level of detail exceeds industry standards for MoE disclosure.
Training Compute
While the technical report confirms the model was trained on Huawei Ascend AI chips (demonstrating hardware transparency), it lacks specific disclosure of total GPU/TPU hours, energy consumption, or carbon footprint calculations. The mention of 'massive resources' and 'different GPU clusters' for the slime infrastructure is descriptive but lacks the quantitative metrics required for a high score.
Benchmark Reproducibility
The model provides results for a wide array of benchmarks including SWE-bench Verified (77.8%), AIME 2026 (92.7%), and Vending Bench 2. Evaluation settings (temperature, top_p, context window) are disclosed in the technical report and model card. However, while evaluation code for some benchmarks is available in the repository, third-party verification for the most recent agentic benchmarks (like Vending Bench 2) is still emerging.
Identity Consistency
GLM-5 demonstrates high identity consistency, correctly identifying itself and its versioning in system prompts and documentation. It is transparent about its capabilities as a reasoning and agentic model and its limitations regarding long-context stability (200K tokens). There are no documented cases of the model claiming to be a competitor's product.
License Clarity
The model is released under the MIT License, which is explicitly stated in the official blog, the GitHub repository, and the Hugging Face model card. This is a highly permissive, standard open-source license with no conflicting commercial restrictions or 'open-ish' caveats, providing maximum clarity for derivative works and commercial use.
Hardware Footprint
Hardware requirements are well-documented for various precisions. Official documentation and community guides (e.g., Unsloth, vLLM) provide VRAM requirements for BF16 (~1.5TB), FP8 (~756GB), and various quantization levels (2-bit GGUF at 241GB). The impact of context length on KV-cache memory is also detailed, including the benefits of the DSA mechanism in reducing memory overhead.
Versioning Drift
The model follows a clear versioning path from GLM-4.5 to 5.0, and the GitHub repository maintains a basic changelog. However, as a relatively new release, there is limited long-term data on silent behavior drift or a formal deprecation policy for previous iterations. Semantic versioning is used, but the frequency of 'silent' weight updates in the early release phase remains a point of skepticism.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens