GLM-5: Specifications and GPU VRAM Requirements

GLM-5

Open Source

Open Weights

Active Parameters

744B

Context Length

204.8K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

MIT

Release Date

12 Feb 2026

Knowledge Cutoff

Dec 2025

Technical Specifications

Total Expert Parameters

40.0B

Number of Experts

256

Active Experts

Attention Structure

Multi-Head Attention

Hidden Dimension Size

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

GLM-5

GLM-5 is a flagship multimodal foundation model developed by Z.ai, designed for complex systems engineering and long-horizon agentic workflows. Utilizing a Mixture-of-Experts (MoE) architecture, the model scales to 744 billion total parameters with approximately 40 billion parameters activated per token. This design facilitates high-capacity reasoning and specialized knowledge retrieval while maintaining the computational efficiency required for large-scale deployment. The model is trained on a massive 28.5 trillion token corpus, emphasizing high-quality code, technical documentation, and reasoning-dense data to support professional-grade software development and autonomous problem-solving.

Technically, GLM-5 introduces several architectural innovations, most notably the integration of DeepSeek Sparse Attention (DSA). This mechanism optimizes the standard attention block by dynamically allocating computational resources, which significantly reduces the memory and compute overhead associated with processing long sequences. Additionally, the model leverages an asynchronous reinforcement learning infrastructure known as 'slime' during post-training. This framework decouples generation from training to improve iteration throughput, allowing the model to learn effectively from complex, multi-step interactions and dynamic environments.

Optimized for long-context stability, GLM-5 supports a context window of up to 204,800 tokens and is capable of generating up to 128,000 tokens in a single output. Its operational capabilities include advanced tool-use, real-time streaming, and structured output across frontend, backend, and data processing tasks. The model is released with open weights under the MIT License, enabling researchers and developers to perform local serving, fine-tuning, and integration into diverse agentic frameworks without vendor lock-in.

About GLM 5

GLM 5 is the fifth generation of General Language Models developed by Z.ai. It represents a significant leap in multimodal foundational capabilities, featuring advanced reasoning and long-horizon agentic capabilities across diverse systems engineering tasks.

Other GLM 5 Models

No related models available

Evaluation Benchmarks

Rank

#16

Benchmark	Score	Rank
Agentic Coding LiveBench Agentic	0.55	🥉 3
Web Development WebDev Arena	1455	⭐ 6
Mathematics LiveBench Mathematics	0.83	10
Reasoning LiveBench Reasoning	0.69	15

Rankings

Overall Rank

#16

Coding Rank

#15

Model Transparency

Total Score

B+

79 / 100

Upstream

24.5 / 30

Model

30.0 / 40

Downstream

24.0 / 30

GLM-5 Transparency Report

Total Score

/ 100

B+

Audit Note

GLM-5 exhibits a high level of transparency, particularly regarding its complex MoE architecture and licensing. The technical documentation provides an unusually detailed breakdown of expert routing and attention mechanisms. While it excels in architectural and legal clarity, it remains less transparent about the specific environmental impact and total compute hours utilized during its massive 28.5T token training run.

Upstream

24.5 / 30

Architectural Provenance

8.5 / 10

GLM-5 is extensively documented in the technical report 'GLM-5: from Vibe Coding to Agentic Engineering' (arXiv:2602.11354). The architecture is a Mixture-of-Experts (MoE) transformer with 80 layers and 256 experts. It introduces DeepSeek Sparse Attention (DSA) and Multi-Head Latent Attention (MLA) to optimize long-context performance. The post-training methodology, specifically the 'slime' asynchronous reinforcement learning infrastructure, is detailed with clear explanations of how generation and training are decoupled to improve iteration throughput.

Dataset Composition

7.0 / 10

The model was trained on a 28.5 trillion token corpus. Documentation specifies a two-stage process: general pre-training (~27T tokens) and a mid-training phase for long-context and agentic data. The technical report provides a breakdown of data sources including web (refined via DCLM classifiers), code (Software Heritage snapshots), and math/science (papers and books). While specific percentage breakdowns for every category are not provided in a single table, the filtering and cleaning methodologies (PPL, deduplication, and LLM-based scoring) are well-documented.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly available via the official GitHub and Hugging Face repositories. It features a vocabulary size of 154,880 tokens. The tokenization approach is consistent with the GLM family's bilingual (Chinese/English) and code-heavy focus, and its integration is verified through public inference engines like vLLM and SGLang which support the model natively.

Model

30.0 / 40

Parameter Density

9.5 / 10

Z.ai provides precise transparency regarding parameter density. The model contains 744 billion total parameters, with exactly 40 billion parameters activated per token. The architectural breakdown is highly detailed: 80 layers total, consisting of 3 dense layers and 75 MoE layers, with 256 experts (8 routed and 1 shared per MoE layer). This level of detail exceeds industry standards for MoE disclosure.

Training Compute

4.0 / 10

While the technical report confirms the model was trained on Huawei Ascend AI chips (demonstrating hardware transparency), it lacks specific disclosure of total GPU/TPU hours, energy consumption, or carbon footprint calculations. The mention of 'massive resources' and 'different GPU clusters' for the slime infrastructure is descriptive but lacks the quantitative metrics required for a high score.

Benchmark Reproducibility

7.5 / 10

The model provides results for a wide array of benchmarks including SWE-bench Verified (77.8%), AIME 2026 (92.7%), and Vending Bench 2. Evaluation settings (temperature, top_p, context window) are disclosed in the technical report and model card. However, while evaluation code for some benchmarks is available in the repository, third-party verification for the most recent agentic benchmarks (like Vending Bench 2) is still emerging.

Identity Consistency

9.0 / 10

GLM-5 demonstrates high identity consistency, correctly identifying itself and its versioning in system prompts and documentation. It is transparent about its capabilities as a reasoning and agentic model and its limitations regarding long-context stability (200K tokens). There are no documented cases of the model claiming to be a competitor's product.

Downstream

24.0 / 30

License Clarity

10.0 / 10

The model is released under the MIT License, which is explicitly stated in the official blog, the GitHub repository, and the Hugging Face model card. This is a highly permissive, standard open-source license with no conflicting commercial restrictions or 'open-ish' caveats, providing maximum clarity for derivative works and commercial use.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented for various precisions. Official documentation and community guides (e.g., Unsloth, vLLM) provide VRAM requirements for BF16 (~1.5TB), FP8 (~756GB), and various quantization levels (2-bit GGUF at 241GB). The impact of context length on KV-cache memory is also detailed, including the benefits of the DSA mechanism in reducing memory overhead.

Versioning Drift

6.0 / 10

The model follows a clear versioning path from GLM-4.5 to 5.0, and the GitHub repository maintains a basic changelog. However, as a relatively new release, there is limited long-term data on silent behavior drift or a formal deprecation policy for previous iterations. Semantic versioning is used, but the frequency of 'silent' weight updates in the early release phase remains a point of skepticism.

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

100k

200k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Read the Paper Download Weights Source Code