Parameters
6B
Context Length
32.768K
Modality
Text
Architecture
Dense
License
Custom License (ChatGLM2-6B License)
Release Date
25 Jun 2023
Knowledge Cutoff
-
Attention
Attention Structure
Multi-Head Attention
Attention Heads
32
Key-Value Heads
2
Attention Head Dimension
128
Position Embedding
Absolute Position Embedding
RoPE Theta
-
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
4,096
Number of Layers
28
FFN Intermediate Size (Dense)
13,696
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
65,024
ChatGLM2-6B is a bilingual large language model designed to facilitate conversational interactions in both Chinese and English. As the second iteration in the ChatGLM series developed by THUDM, it is built upon the General Language Model (GLM) framework and serves as a versatile tool for dialogue generation and cross-lingual text processing. The model is optimized for execution on consumer-grade hardware through efficient architectural choices, enabling a high degree of accessibility for developers and researchers working within hardware-constrained environments.
The architecture utilizes a dense transformer structure that incorporates several technical advancements over its predecessor. A key innovation is the adoption of Multi-Query Attention (MQA), which streamlines inference by sharing key and value heads across multiple query heads, significantly reducing the memory footprint of the KV cache. Furthermore, the model integrates Rotary Position Embeddings (RoPE) to capture token relationships and utilizes RMSNorm for improved training stability. The inclusion of FlashAttention during the pre-training phase allows the architecture to support a substantial context window, facilitating the processing of extended dialogue histories.
Operating with 6 billion parameters, ChatGLM2-6B provides a balanced profile of performance and efficiency. It was pre-trained on a diverse dataset comprising 1.4 trillion tokens and refined through human preference alignment to enhance its conversational quality. The model is particularly suited for applications such as intelligent virtual assistants and localized chatbots, where low-latency inference and bilingual proficiency are primary requirements. Its open-weights nature and support for INT4 quantization further expand its utility for local deployment and integration into specialized NLP pipelines.
ChatGLM series models from Z.ai, based on GLM architecture.
Rank
#156
| Benchmark | Score | Rank |
|---|---|---|
Web Development WebDev Arena | 1024 | 91 |
Overall Rank
#156
Coding Rank
#123
Total Score
62
/ 100
ChatGLM2-6B exhibits strong transparency in its architectural framework and hardware requirements, providing clear documentation on its transition to Multi-Query Attention and its suitability for consumer-grade GPUs. However, the model suffers from significant opacity regarding its training data sources and compute resources, relying on vague descriptions of data quality rather than verifiable composition breakdowns. While the open-weights nature and accessible tokenizer support developer integration, the use of a restrictive custom license and lack of detailed evaluation methodologies hinder its standing as a fully transparent open-source project.
Architectural Provenance
The model is explicitly identified as the second generation of the ChatGLM series, built on the General Language Model (GLM) framework. Technical documentation and the official GitHub repository detail significant architectural modifications from the first generation, including the adoption of Multi-Query Attention (MQA) for KV cache efficiency, Rotary Position Embeddings (RoPE), and RMSNorm for stability. While the pre-training objective (hybrid GLM objective) is named, the specific layer-by-layer configuration is primarily accessible through the open-source code rather than a formal peer-reviewed technical paper for this specific version.
Dataset Composition
The provider discloses that the model was pre-trained on 1.4 trillion bilingual (Chinese and English) tokens. However, there is no detailed breakdown of the dataset composition (e.g., percentages of web, code, or books). The specific sources of the data are not named, and the filtering or cleaning methodologies are described only in vague terms like 'more and better data' or 'high-quality data.' No sample data or specific data collection protocols are publicly available.
Tokenizer Integrity
The tokenizer is publicly accessible via the official repository and Hugging Face (tokenization_chatglm.py). It uses a SentencePiece-based approach with a documented vocabulary size of 64,793 tokens. The implementation details, including special tokens like <bos>, <eos>, and <pad>, are clearly defined in the source code. It is specifically optimized for bilingual support, though some minor inconsistencies between tokenizer and model config vocabulary sizes have been noted in community issues.
Parameter Density
The model is clearly stated to have 6 billion parameters. As a dense transformer architecture, the active parameters equal the total parameters. The architectural choices, such as the use of Multi-Query Attention, provide a clear understanding of how parameters are distributed across the attention mechanism versus feed-forward networks. However, a precise numerical breakdown of parameter counts per component (e.g., exact FFN vs. Attention split) is not explicitly provided in summary documentation, though it can be derived from the code.
Training Compute
There is almost no transparency regarding the training compute. The provider does not disclose the total GPU/TPU hours, the specific hardware cluster used for the 1.4T token training, or the training duration. Environmental impact metrics, such as carbon footprint or energy consumption, are entirely absent from official documentation.
Benchmark Reproducibility
The provider reports scores on several standard benchmarks (MMLU, C-Eval, GSM8K, BBH) and provides a comparison to the previous version. While the benchmarks are named, the exact evaluation prompts, few-shot settings, and specific versions of the datasets used are not fully documented in a reproducible format. Third-party verification is available through public leaderboards, but the lack of an official evaluation script or detailed methodology limits full reproducibility.
Identity Consistency
The model consistently identifies itself as ChatGLM2-6B or an AI assistant developed by the GLM team. It maintains a clear versioning identity distinct from its predecessor and successor (ChatGLM3). There are no widespread reports of the model claiming to be a competitor's product (like GPT-4) or denying its nature as an AI, though its self-knowledge is limited to what was included in its alignment training.
License Clarity
The model uses a custom 'ChatGLM2-6B License.' While the weights are open for academic research and free commercial use is permitted, it requires users to complete a registration questionnaire for commercial applications. The license includes specific restrictions related to use cases that might 'undermine China's national security,' which introduces some legal ambiguity for international users compared to standard OSI licenses like Apache 2.0.
Hardware Footprint
Hardware requirements are well-documented. The provider explicitly states VRAM needs for different precision levels, noting that 6GB of VRAM is sufficient for INT4 quantization. Documentation includes specific performance gains from MQA and FlashAttention. Quantization impact is acknowledged, and community-driven tools (like the Hugging Face Model Memory Utility) provide further verifiable data on memory scaling.
Versioning Drift
The model follows a clear generational versioning (ChatGLM -> ChatGLM2 -> ChatGLM3). However, within the ChatGLM2-6B lifecycle, updates to checkpoints and code are often pushed to the main branch of the repository without rigorous semantic versioning or detailed changelogs for minor revisions. This makes tracking silent performance drift or behavioral changes difficult for developers relying on the latest 'main' branch.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online