Parameters
130B
Context Length
2K
Modality
Text
Architecture
Dense
License
Apache 2.0
Release Date
4 Aug 2022
Knowledge Cutoff
Jul 2022
Attention
Attention Structure
Multi-Head Attention
Attention Heads
-
Key-Value Heads
-
Attention Head Dimension
-
Position Embedding
Absolute Position Embedding
RoPE Theta
-
Sliding Window Attention
-
Sliding Window Size
-
Normalization
Deep Normalization
Activation Function
GELU
Dimensions
Hidden Dimension Size
12,288
Number of Layers
70
FFN Intermediate Size (Dense)
-
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
-
GLM-130B is a bidirectional dense model featuring 130 billion parameters, developed for both English and Chinese language processing. This model is pre-trained using the General Language Model (GLM) algorithm, which employs an autoregressive blank infilling objective. This pre-training approach involves masking random continuous spans of text and subsequently predicting these masked segments autoregressively. This methodology contributes to its performance in various natural language processing tasks, including text comprehension, generation, and translation.
The architectural design of GLM-130B incorporates specific innovations to enhance training stability and inference efficiency for a model of its scale. It utilizes Rotary Positional Encoding (RoPE) for positional embeddings and integrates the Gated Linear Unit (GLU) with the Gaussian Error Linear Unit (GeLU) activation function within its Feed-Forward Networks (FFNs). The model also employs DeepNorm for layer normalization, a Post-Layer Normalization (Post-LN) technique, which has been shown to stabilize the training of large language models.
GLM-130B supports fast inference, making it suitable for real-time large-scale language processing tasks. It is designed to enable inference on a single A100 (40G * 8) or V100 (32G * 8) server. Further optimizations, such as INT4 quantization, allow for efficient inference on more accessible hardware, including a single server equipped with 4 RTX 3090 (24G) GPUs with minimal performance degradation. The model has been trained on over 400 billion text tokens, with an equal distribution of English and Chinese data.
General Language Models from Z.ai
No evaluation benchmarks for GLM-130B available.
Overall Rank
-
Coding Rank
-
Total Score
72
/ 100
GLM-130B exhibits high transparency in its architectural design and hardware requirements, providing detailed technical specifications that exceed industry standards for large-scale models. However, its transparency profile is significantly weakened by a restrictive and legally ambiguous weight license and limited disclosure regarding the fine-grained cleaning of its Chinese training data. While the model is highly verifiable from a technical standpoint, its downstream utility is constrained by these licensing complexities.
Architectural Provenance
The model's architecture is extensively documented in the ICLR 2023 paper 'GLM-130B: An Open Bilingual Pre-trained Model'. It explicitly details the use of the General Language Model (GLM) algorithm with a bidirectional attention mechanism and an autoregressive blank infilling objective. Technical innovations such as Rotary Positional Encoding (RoPE), GeGLU activation, and DeepNorm for layer normalization are clearly described. The pre-training procedure, including the 3D parallel strategy (data, tensor, and pipeline parallelism), is thoroughly documented with specific configurations provided.
Dataset Composition
The training data sources are disclosed as a balanced bilingual corpus of 400 billion tokens (200B English, 200B Chinese). Specific datasets named include the 1.2T Pile (English) and 1.0T WuDaoCorpora (Chinese), along with 250GB of additional Chinese web data. While the general breakdown and sources are provided, detailed documentation on the specific filtering and cleaning methodologies for the custom-crawled Chinese data is less comprehensive than the documentation for the model architecture itself.
Tokenizer Integrity
GLM-130B uses the 'icetk' tokenizer, which is publicly available and specifically designed for bilingual (English/Chinese) and multimodal tasks. The vocabulary size is precisely stated as 150,000 tokens, with a clear breakdown of token categories (20,000 image tokens, 130,000 text tokens). The tokenizer's training on a 25GB bilingual corpus is documented, and its implementation is accessible via the official GitHub repository, allowing for full verification of its behavior and language support.
Parameter Density
The model is a dense architecture with 130 billion parameters. The parameter count is consistently stated across all official documentation. Detailed architectural specifications are provided, including 70 transformer layers and a hidden state dimension of 12,288. As a dense model, all parameters are active during inference, and this is clearly distinguished from sparse MoE models in the technical report. The impact of INT4 and INT8 quantization on parameter representation is also well-documented.
Training Compute
The training hardware is explicitly disclosed as a cluster of 96 NVIDIA DGX-A100 (8x40G) nodes. The training duration is stated as 60 days (May 6 to July 3, 2022). The paper provides detailed compute metrics, including hardware FLOPs utilization (HFU) of 43.3% and model FLOPs utilization (MFU) of 32.5%. While specific carbon footprint calculations are not provided in the primary paper, the level of hardware and duration disclosure is significantly higher than industry averages.
Benchmark Reproducibility
The official GitHub repository includes evaluation code and bash scripts to reproduce results across 30+ tasks. Benchmark versions (e.g., MMLU, LAMBADA, BIG-bench-lite) are specified. However, the exact few-shot prompts and examples used for all 112 mentioned tasks are not fully centralized in a single accessible document, and third-party reports indicate that reproducing exact scores can be challenging due to environment sensitivities and prompt variations.
Identity Consistency
The model consistently identifies as GLM-130B and is transparent about its bilingual capabilities and the specific GLM architecture. It does not claim to be a different model (like GPT-4) and its documentation clearly outlines its limitations compared to larger or instruction-tuned models. Versioning is maintained through the GitHub repository and associated technical reports.
License Clarity
The licensing structure is fragmented and contains significant restrictions. While the code is under Apache 2.0, the model weights are governed by a separate 'GLM-130B Model License'. This license restricts use to non-commercial research purposes only and includes vague, legally complex clauses prohibiting acts that 'undermine China's national security'. These terms conflict with the 'open source' marketing often associated with the project and create significant ambiguity for international users.
Hardware Footprint
Hardware requirements are exceptionally well-documented. The team provides specific VRAM requirements for FP16 (260GB), INT8, and INT4 (70GB) precision. They explicitly state the hardware configurations needed for inference, such as a single 8x A100 server for FP16 or 4x RTX 3090 GPUs for INT4. Quantization accuracy tradeoffs are documented with specific benchmark deltas (e.g., -0.74% on LAMBADA for INT4), providing high transparency for downstream deployment.
Versioning Drift
The model uses basic versioning (v1.0) and maintains a GitHub repository for updates. However, it lacks a formal semantic versioning system or a detailed public changelog for weight updates. While major milestones are noted in the paper and blog, there is limited transparency regarding minor weight adjustments or silent updates that might affect performance consistency over time.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online