ApX logoApX logo

GLM-130B

Parameters

130B

Context Length

2K

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

4 Aug 2022

Knowledge Cutoff

Jul 2022

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

-

Key-Value Heads

-

Attention Head Dimension

-

Position Embedding

Absolute Position Embedding

RoPE Theta

-

Sliding Window Attention

-

Sliding Window Size

-

Normalization

Deep Normalization

Activation Function

GELU

Dimensions

Hidden Dimension Size

12,288

Number of Layers

70

FFN Intermediate Size (Dense)

-

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

-

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 12.3k · Context: 2Kx 70 layersDeepNormPre-AttentionMulti-Head Attention+DeepNormPre-FFNFeed-Forward NetworkGELU+Final DeepNormOutput Logits

GLM-130B

GLM-130B is a bidirectional dense model featuring 130 billion parameters, developed for both English and Chinese language processing. This model is pre-trained using the General Language Model (GLM) algorithm, which employs an autoregressive blank infilling objective. This pre-training approach involves masking random continuous spans of text and subsequently predicting these masked segments autoregressively. This methodology contributes to its performance in various natural language processing tasks, including text comprehension, generation, and translation.

The architectural design of GLM-130B incorporates specific innovations to enhance training stability and inference efficiency for a model of its scale. It utilizes Rotary Positional Encoding (RoPE) for positional embeddings and integrates the Gated Linear Unit (GLU) with the Gaussian Error Linear Unit (GeLU) activation function within its Feed-Forward Networks (FFNs). The model also employs DeepNorm for layer normalization, a Post-Layer Normalization (Post-LN) technique, which has been shown to stabilize the training of large language models.

GLM-130B supports fast inference, making it suitable for real-time large-scale language processing tasks. It is designed to enable inference on a single A100 (40G * 8) or V100 (32G * 8) server. Further optimizations, such as INT4 quantization, allow for efficient inference on more accessible hardware, including a single server equipped with 4 RTX 3090 (24G) GPUs with minimal performance degradation. The model has been trained on over 400 billion text tokens, with an equal distribution of English and Chinese data.

About GLM Family

General Language Models from Z.ai


Other GLM Family Models

Evaluation Benchmarks

No evaluation benchmarks for GLM-130B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B+

72 / 100

GLM-130B Model Integrity Report

Total Score

72

/ 100

B+

Audit Note

GLM-130B exhibits high transparency in its architectural design and hardware requirements, providing detailed technical specifications that exceed industry standards for large-scale models. However, its transparency profile is significantly weakened by a restrictive and legally ambiguous weight license and limited disclosure regarding the fine-grained cleaning of its Chinese training data. While the model is highly verifiable from a technical standpoint, its downstream utility is constrained by these licensing complexities.

Upstream

24.0 / 30

Architectural Provenance

8.5 / 10

The model's architecture is extensively documented in the ICLR 2023 paper 'GLM-130B: An Open Bilingual Pre-trained Model'. It explicitly details the use of the General Language Model (GLM) algorithm with a bidirectional attention mechanism and an autoregressive blank infilling objective. Technical innovations such as Rotary Positional Encoding (RoPE), GeGLU activation, and DeepNorm for layer normalization are clearly described. The pre-training procedure, including the 3D parallel strategy (data, tensor, and pipeline parallelism), is thoroughly documented with specific configurations provided.

Dataset Composition

6.5 / 10

The training data sources are disclosed as a balanced bilingual corpus of 400 billion tokens (200B English, 200B Chinese). Specific datasets named include the 1.2T Pile (English) and 1.0T WuDaoCorpora (Chinese), along with 250GB of additional Chinese web data. While the general breakdown and sources are provided, detailed documentation on the specific filtering and cleaning methodologies for the custom-crawled Chinese data is less comprehensive than the documentation for the model architecture itself.

Tokenizer Integrity

9.0 / 10

GLM-130B uses the 'icetk' tokenizer, which is publicly available and specifically designed for bilingual (English/Chinese) and multimodal tasks. The vocabulary size is precisely stated as 150,000 tokens, with a clear breakdown of token categories (20,000 image tokens, 130,000 text tokens). The tokenizer's training on a 25GB bilingual corpus is documented, and its implementation is accessible via the official GitHub repository, allowing for full verification of its behavior and language support.

Model

30.5 / 40

Parameter Density

8.0 / 10

The model is a dense architecture with 130 billion parameters. The parameter count is consistently stated across all official documentation. Detailed architectural specifications are provided, including 70 transformer layers and a hidden state dimension of 12,288. As a dense model, all parameters are active during inference, and this is clearly distinguished from sparse MoE models in the technical report. The impact of INT4 and INT8 quantization on parameter representation is also well-documented.

Training Compute

7.5 / 10

The training hardware is explicitly disclosed as a cluster of 96 NVIDIA DGX-A100 (8x40G) nodes. The training duration is stated as 60 days (May 6 to July 3, 2022). The paper provides detailed compute metrics, including hardware FLOPs utilization (HFU) of 43.3% and model FLOPs utilization (MFU) of 32.5%. While specific carbon footprint calculations are not provided in the primary paper, the level of hardware and duration disclosure is significantly higher than industry averages.

Benchmark Reproducibility

6.0 / 10

The official GitHub repository includes evaluation code and bash scripts to reproduce results across 30+ tasks. Benchmark versions (e.g., MMLU, LAMBADA, BIG-bench-lite) are specified. However, the exact few-shot prompts and examples used for all 112 mentioned tasks are not fully centralized in a single accessible document, and third-party reports indicate that reproducing exact scores can be challenging due to environment sensitivities and prompt variations.

Identity Consistency

9.0 / 10

The model consistently identifies as GLM-130B and is transparent about its bilingual capabilities and the specific GLM architecture. It does not claim to be a different model (like GPT-4) and its documentation clearly outlines its limitations compared to larger or instruction-tuned models. Versioning is maintained through the GitHub repository and associated technical reports.

Downstream

17.5 / 30

License Clarity

4.0 / 10

The licensing structure is fragmented and contains significant restrictions. While the code is under Apache 2.0, the model weights are governed by a separate 'GLM-130B Model License'. This license restricts use to non-commercial research purposes only and includes vague, legally complex clauses prohibiting acts that 'undermine China's national security'. These terms conflict with the 'open source' marketing often associated with the project and create significant ambiguity for international users.

Hardware Footprint

8.5 / 10

Hardware requirements are exceptionally well-documented. The team provides specific VRAM requirements for FP16 (260GB), INT8, and INT4 (70GB) precision. They explicitly state the hardware configurations needed for inference, such as a single 8x A100 server for FP16 or 4x RTX 3090 GPUs for INT4. Quantization accuracy tradeoffs are documented with specific benchmark deltas (e.g., -0.74% on LAMBADA for INT4), providing high transparency for downstream deployment.

Versioning Drift

5.0 / 10

The model uses basic versioning (v1.0) and maintains a GitHub repository for updates. However, it lacks a formal semantic versioning system or a detailed public changelog for weight updates. While major milestones are noted in the paper and blog, there is limited transparency regarding minor weight adjustments or silent updates that might affect performance consistency over time.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
1k
2k

VRAM Required:

Recommended GPUs

GLM-130B: Specifications and GPU VRAM Requirements