GLM-4.5-Air: Specifications and GPU VRAM Requirements

GLM-4.5-Air

Open Source

Open Weights

Active Parameters

106B

Context Length

128K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

MIT License

Release Date

28 Jul 2025

Knowledge Cutoff

Mar 2025

Technical Specifications

Total Expert Parameters

12.0B

Number of Experts

129

Active Experts

Attention Structure

Multi-Head Attention

Hidden Dimension Size

4096

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

Swish

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

GLM-4.5-Air

GLM-4.5-Air is a high-efficiency large language model developed by Z.ai as part of the GLM-4.5 series. It is designed to bridge the gap between massive-scale foundation models and the practical constraints of on-device or mid-range cloud deployments. Optimized primarily for agent-oriented workflows, the model prioritizes reasoning, complex instruction following, and code generation. It functions as a versatile engine for autonomous agents capable of multi-step planning and tool invocation, making it a viable selection for developers building sophisticated digital assistants and automated software engineering pipelines.

Architecturally, the model utilizes a sparse Mixture-of-Experts (MoE) framework, featuring 106 billion total parameters with only 12 billion active per forward pass. This design incorporates 128 routed experts and a specialized shared expert layer, activating 9 experts per token to maintain representational capacity while significantly reducing computational overhead. The transformer block is further enhanced by a Multi-Token Prediction (MTP) layer, which allows the model to predict several future tokens simultaneously. This implementation facilitates speculative decoding, which increases inference throughput and provides a responsive experience for real-time interactive applications.

Technical innovations in GLM-4.5-Air include the adoption of Grouped-Query Attention (GQA) with 96 attention heads and 8 key-value groups, reducing memory bandwidth requirements during long-context processing. The model supports a 128,000-token context window using Rotary Positional Embeddings (RoPE) and features a hybrid reasoning system. This system allows for a deliberate thinking mode, which executes a latent chain-of-thought process for analytical problem-solving, and a standard mode for immediate output. Native integration for function calling, web browsing, and code execution ensures the model can interact with external environments with high reliability.

About GLM Family

General Language Models from Z.ai

Other GLM Family Models

Evaluation Benchmarks

Rank

#22

Benchmark	Score	Rank
Professional Knowledge MMLU Pro	0.81	11
Web Development WebDev Arena	1371	25

Rankings

Overall Rank

#22

Coding Rank

#35

Model Transparency

Total Score

70 / 100

Upstream

22.0 / 30

Model

25.5 / 40

Downstream

22.0 / 30

GLM-4.5-Air Transparency Report

Total Score

/ 100

Audit Note

GLM-4.5-Air demonstrates strong transparency regarding its MoE architecture and licensing, providing clear distinctions between active and total parameters under a permissive MIT license. However, the profile is weakened by a lack of disclosure regarding training compute resources and environmental impact, alongside only generalized descriptions of its 23-trillion-token training data. While reproducibility is supported by the release of specific agent trajectories, the broader evaluation suite remains partially opaque.

Upstream

22.0 / 30

Architectural Provenance

8.0 / 10

The model's architecture is extensively documented as a sparse Mixture-of-Experts (MoE) transformer. Technical details include the use of 128 routed experts with a specialized shared expert layer, Grouped-Query Attention (GQA) with 96 heads, and a Multi-Token Prediction (MTP) layer for speculative decoding. The implementation is integrated into the public 'transformers' library, and the 'Slime' training framework is open-sourced. While the base model is named and its evolution from the GLM-4 series is clear, the specific pretraining methodology for the 'Air' variant versus the flagship model is only partially differentiated in public reports.

Dataset Composition

5.0 / 10

Z.ai discloses a total training corpus of approximately 23 trillion tokens, with a breakdown of 15 trillion for general domain data and 8 trillion focused on code, reasoning, and math. Methodologies such as 'SemDeDup' for semantic deduplication and 'quality-aware sampling' are mentioned in technical reports. However, specific data sources (e.g., which specific web crawls or book datasets) are not named, and the exact proportions of the final mixture are not provided, falling short of high-transparency standards for dataset disclosure.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly available via the Hugging Face repository with a stated vocabulary size of 151,552 tokens. It is based on the tiktoken implementation and is fully integrated into standard inference frameworks like vLLM and SGLang. The vocabulary size and tokenization approach are consistent across official documentation and third-party implementation files.

Model

25.5 / 40

Parameter Density

8.5 / 10

The model clearly distinguishes between its total parameter count (106 billion) and its active parameters (12 billion per forward pass). It specifies the activation of 9 experts per token (8 routed + 1 shared). This level of detail for an MoE architecture is exemplary, though a full layer-by-layer parameter breakdown (e.g., attention vs. FFN weights) is not explicitly tabulated in the primary model card.

Training Compute

2.0 / 10

While the hardware used for fine-tuning and inference is mentioned (e.g., H100 and H200 clusters), there is no public disclosure of the total GPU/TPU hours required for the 23T token pretraining phase. Furthermore, there is no provided calculation of the carbon footprint or estimated total compute cost, which are requirements for a high score in this category.

Benchmark Reproducibility

6.0 / 10

Z.ai provides results for 12 industry-standard benchmarks and has released a specific dataset of 52 coding tasks with full agent trajectories for reproducibility. However, the evaluation code for the broader benchmark suite is not fully public, and some results rely on 'internal benchmarks' or specific frameworks (like Terminus) that require additional setup to verify. The use of 'Avg@32' for math benchmarks is noted but lacks the exact seeds/prompts for full bit-for-bit reproduction.

Identity Consistency

9.0 / 10

The model consistently identifies as part of the GLM-4.5 series and maintains a clear distinction between its 'Thinking' and 'Non-Thinking' modes. It does not exhibit identity confusion with models from other providers (e.g., GPT or Claude) in official documentation or API responses. Versioning is clear within the Z.ai ecosystem.

Downstream

22.0 / 30

License Clarity

9.5 / 10

The model weights and associated code are released under the highly permissive MIT License, which explicitly allows for commercial use and derivative works. This is a significant improvement over previous GLM versions that used custom, more restrictive licenses. The license is clearly stated on Hugging Face, GitHub, and in official blog posts.

Hardware Footprint

7.5 / 10

VRAM requirements are well-documented for various precisions (BF16, FP8) and context lengths. Documentation specifies that the Air variant can run on 2x H100 GPUs for the full 128K context or even consumer-grade hardware (like a single RTX 4090) when using 4-bit quantization. While detailed quantization-accuracy tradeoff curves are missing, the basic requirements are realistic and verified by community testing.

Versioning Drift

5.0 / 10

The model uses a versioning system (GLM-4.5-Air), but a detailed, public changelog tracking specific weight updates or 'silent' alignment shifts is not maintained. While the release date is clear, there is no formal deprecation policy or historical version access for intermediate checkpoints during the post-training phase.

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

63k

125k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Download Weights Source Code