GLM-4-9B

Open Source

Open Weights

Parameters

Context Length

128K

Modality

Text

Architecture

Dense

License

MIT License

Release Date

30 Jun 2024

Knowledge Cutoff

Apr 2024

System Requirements

VRAM requirements for different quantization methods and context sizes

1,024 tokens

20.44 GB VRAM

Consumer

1x RTX 4090

24GB VRAM

Datacenter

1x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

128,000 tokens

25.91 GB VRAM

Consumer

2x RTX 4090

24GB VRAM

Datacenter

1x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

Architecture Diagram

Evaluation Benchmarks

No evaluation benchmarks for GLM-4-9B available.

Rankings

Overall Rank

Coding Rank

About GLM-4-9B

The GLM-4-9B represents a significant iteration in the General Language Model (GLM) series developed by Zhipu AI and the THUDM Laboratory at Tsinghua University. This 9-billion parameter model is engineered to provide a sophisticated balance between computational efficiency and high-level linguistic performance, supporting a multilingual corpus across 26 languages. It is designed for diverse applications, including high-throughput translation, automated content synthesis, and complex question-answering systems. The model is released with open weights under the MIT License, facilitating broad community adoption and research in the field of large-scale pre-training.

Architecturally, GLM-4-9B is built upon a dense transformer framework that incorporates several structural optimizations. It utilizes Grouped Query Attention (GQA) with 32 attention heads and 2 key-value heads to reduce memory overhead during inference while maintaining robust semantic representation. The model implements an autoregressive blank-infilling objective during its pre-training on 10 trillion tokens, which enhances its ability to handle both prefix-based generation and bidirectional understanding. To support long-context processing, it employs Rotary Position Embeddings (RoPE) and is capable of extending its context window up to 128,000 tokens through YaRN (Yet another RoPE extensioN) scaling techniques.

Technical refinements in the GLM-4-9B architecture include the use of RMSNorm for stable layer normalization and the SiLU (Sigmoid Linear Unit) activation function, often implemented within a SwiGLU-style feed-forward network. The design specifically omits bias terms in most linear layers, except for those within the Query, Key, and Value components, a choice intended to improve the model's length extrapolation capabilities. This model serves as the foundation for specialized variants, such as the GLM-4-9B-Chat for human-aligned dialogue and the GLM-4V-9B for multimodal vision-language tasks, demonstrating its versatility as a base architecture for production-grade AI systems.

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

Key-Value Heads

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

Sliding Window Attention

Sliding Window Size

Sliding Window Ratio

Linear Attention

Linear Attention Ratio

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

FFN Intermediate Size (Dense)

13,696

Multi-Token Prediction Heads

Tokenizer

Vocabulary Size

151,552

Model Integrity

Total Score

65 / 100

Upstream

21.0 / 30

Model

23.0 / 40

Downstream

20.5 / 30

GLM-4-9B Model Integrity Report

Total Score

/ 100

Audit Note

GLM-4-9B exhibits strong transparency in its architectural design and tokenizer implementation, supported by a detailed technical report and open-source code. While its licensing and identity consistency are commendable, the model suffers from significant opacity regarding its training compute resources and the specific composition of its 10-trillion-token dataset. Reproducibility of benchmark results remains a challenge for the community due to incomplete disclosure of evaluation prompts and environments.

Upstream

21.0 / 30

Architectural Provenance

8.0 / 10

The GLM-4-9B architecture is extensively documented in the 'ChatGLM: A Family of Large Language Models' technical report and official GitHub repository. It clearly defines the model as a dense transformer using Grouped Query Attention (GQA) with 32 attention heads and 2 KV heads. Specific structural choices, such as the removal of bias terms (except in QKV layers) for length extrapolation and the use of RMSNorm and SwiGLU, are explicitly stated. The pre-training objective—an autoregressive blank-infilling task—is a well-documented departure from standard causal decoders, providing high transparency into its design lineage.

Dataset Composition

4.0 / 10

While the total token count (10 trillion) and general language support (26 languages) are disclosed, the specific composition of the dataset lacks granular detail. The technical report mentions the data is 'mostly Chinese and English' but does not provide a percentage breakdown or name specific sources beyond general categories. There is no public documentation on the filtering, cleaning, or deduplication methodologies used for the 10T token corpus, which is a significant gap for a model of this scale.

Tokenizer Integrity

9.0 / 10

The tokenizer is fully transparent and publicly accessible via the 'tokenization_chatglm.py' script on Hugging Face. It uses a Tiktoken-based implementation with a clearly stated vocabulary size of 151,552 tokens. The code provides the exact regex patterns for tokenization and handles special tokens explicitly. This level of detail allows for full verification of how the model processes its claimed 26 supported languages.

Model

23.0 / 40

Parameter Density

7.0 / 10

The model is explicitly identified as a dense architecture with 9 billion total parameters. The technical report and configuration files confirm the layer count (40) and hidden dimension size (4096). While it is not a Mixture-of-Experts (MoE) model, the transparency regarding its parameter distribution across attention and feed-forward networks is high due to the availability of the configuration files and source code.

Training Compute

2.0 / 10

Transparency regarding training compute is very low. The technical report mentions that the model was trained with 'less training compute' than some competitors but fails to disclose specific GPU/TPU hours, hardware types used for the 10T token pre-training, or the duration of the training run. No information regarding the carbon footprint or estimated cost of training is provided.

Benchmark Reproducibility

5.0 / 10

The technical report provides scores for standard benchmarks (MMLU, GSM8K, HumanEval), but reproduction is hindered by limited disclosure of exact prompts and few-shot settings. While some evaluation code is available in the 'composite_demo' and 'LongAlign' repositories, third-party users have reported difficulties in matching official scores (e.g., on LongBench-Chat), suggesting gaps in the documented evaluation environment or sampling parameters.

Identity Consistency

9.0 / 10

The GLM-4-9B model demonstrates high identity consistency, correctly identifying itself as a product of Zhipu AI and the THUDM lab. It maintains clear versioning (e.g., distinguishing between the base 8K version and the 128K/1M context variants) and does not exhibit the identity confusion common in models that claim to be GPT-4 or other competitors.

Downstream

20.5 / 30

License Clarity

8.0 / 10

The model weights are released under the MIT License, which is a highly permissive and clear open-source license. This is confirmed by official communications from the Z.ai organization on Hugging Face. However, there has been some historical ambiguity in repository documentation where Apache 2.0 and custom terms were mentioned, though the current official stance for the 9B weights is clearly MIT.

Hardware Footprint

7.5 / 10

Hardware requirements are well-documented by both the developers and the community. Official documentation specifies VRAM requirements for inference (approx. 18-21GB for FP16) and provides guidance for quantization (INT4/INT8). The impact of context length on memory is also addressed, with specific mentions of YaRN scaling for extending the context window, though detailed memory-scaling curves are not provided.

Versioning Drift

5.0 / 10

The model uses a form of semantic versioning (e.g., GLM-4-9B-0414), and a changelog is maintained in the GitHub repository. However, updates can be irregular, and the transition between versions (such as the update requiring transformers >= 4.44.0) is documented primarily through README updates rather than a formal, centralized versioning system. There is limited public data on performance drift over time following safety or alignment updates.

GLM-4-9B

System Requirements

Architecture Diagram

Evaluation Benchmarks

Rankings

About GLM-4-9B

Technical Specifications

Model Integrity

GLM-4-9B Model Integrity Report

Audit Note

Upstream

Model

Downstream

Resources

About GLM Family

Other GLM Family Models