ApX logoApX logo

ChatGLM2-6B

Parameters

6B

Context Length

32.768K

Modality

Text

Architecture

Dense

License

Custom License (ChatGLM2-6B License)

Release Date

25 Jun 2023

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

32

Key-Value Heads

2

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

-

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

28

FFN Intermediate Size (Dense)

13,696

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

65,024

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 4.1k · Context: 32.8k · Vocab: 65kx 28 layersRMSNormPre-AttentionMulti-Head Attention32Q / 2KV headsHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 13.7k+Final RMSNormOutput Logits

ChatGLM2-6B

ChatGLM2-6B is a bilingual large language model designed to facilitate conversational interactions in both Chinese and English. As the second iteration in the ChatGLM series developed by THUDM, it is built upon the General Language Model (GLM) framework and serves as a versatile tool for dialogue generation and cross-lingual text processing. The model is optimized for execution on consumer-grade hardware through efficient architectural choices, enabling a high degree of accessibility for developers and researchers working within hardware-constrained environments.

The architecture utilizes a dense transformer structure that incorporates several technical advancements over its predecessor. A key innovation is the adoption of Multi-Query Attention (MQA), which streamlines inference by sharing key and value heads across multiple query heads, significantly reducing the memory footprint of the KV cache. Furthermore, the model integrates Rotary Position Embeddings (RoPE) to capture token relationships and utilizes RMSNorm for improved training stability. The inclusion of FlashAttention during the pre-training phase allows the architecture to support a substantial context window, facilitating the processing of extended dialogue histories.

Operating with 6 billion parameters, ChatGLM2-6B provides a balanced profile of performance and efficiency. It was pre-trained on a diverse dataset comprising 1.4 trillion tokens and refined through human preference alignment to enhance its conversational quality. The model is particularly suited for applications such as intelligent virtual assistants and localized chatbots, where low-latency inference and bilingual proficiency are primary requirements. Its open-weights nature and support for INT4 quantization further expand its utility for local deployment and integration into specialized NLP pipelines.

About ChatGLM

ChatGLM series models from Z.ai, based on GLM architecture.


Other ChatGLM Models

Evaluation Benchmarks

Rank

#156

BenchmarkScoreRank

Web Development

WebDev Arena

1024

91

Rankings

Overall Rank

#156

Coding Rank

#123

Model Integrity

Total Score

B-

62 / 100

ChatGLM2-6B Model Integrity Report

Total Score

62

/ 100

B-

Audit Note

ChatGLM2-6B exhibits strong transparency in its architectural framework and hardware requirements, providing clear documentation on its transition to Multi-Query Attention and its suitability for consumer-grade GPUs. However, the model suffers from significant opacity regarding its training data sources and compute resources, relying on vague descriptions of data quality rather than verifiable composition breakdowns. While the open-weights nature and accessible tokenizer support developer integration, the use of a restrictive custom license and lack of detailed evaluation methodologies hinder its standing as a fully transparent open-source project.

Upstream

19.5 / 30

Architectural Provenance

7.5 / 10

The model is explicitly identified as the second generation of the ChatGLM series, built on the General Language Model (GLM) framework. Technical documentation and the official GitHub repository detail significant architectural modifications from the first generation, including the adoption of Multi-Query Attention (MQA) for KV cache efficiency, Rotary Position Embeddings (RoPE), and RMSNorm for stability. While the pre-training objective (hybrid GLM objective) is named, the specific layer-by-layer configuration is primarily accessible through the open-source code rather than a formal peer-reviewed technical paper for this specific version.

Dataset Composition

3.5 / 10

The provider discloses that the model was pre-trained on 1.4 trillion bilingual (Chinese and English) tokens. However, there is no detailed breakdown of the dataset composition (e.g., percentages of web, code, or books). The specific sources of the data are not named, and the filtering or cleaning methodologies are described only in vague terms like 'more and better data' or 'high-quality data.' No sample data or specific data collection protocols are publicly available.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the official repository and Hugging Face (tokenization_chatglm.py). It uses a SentencePiece-based approach with a documented vocabulary size of 64,793 tokens. The implementation details, including special tokens like <bos>, <eos>, and <pad>, are clearly defined in the source code. It is specifically optimized for bilingual support, though some minor inconsistencies between tokenizer and model config vocabulary sizes have been noted in community issues.

Model

23.0 / 40

Parameter Density

7.0 / 10

The model is clearly stated to have 6 billion parameters. As a dense transformer architecture, the active parameters equal the total parameters. The architectural choices, such as the use of Multi-Query Attention, provide a clear understanding of how parameters are distributed across the attention mechanism versus feed-forward networks. However, a precise numerical breakdown of parameter counts per component (e.g., exact FFN vs. Attention split) is not explicitly provided in summary documentation, though it can be derived from the code.

Training Compute

2.0 / 10

There is almost no transparency regarding the training compute. The provider does not disclose the total GPU/TPU hours, the specific hardware cluster used for the 1.4T token training, or the training duration. Environmental impact metrics, such as carbon footprint or energy consumption, are entirely absent from official documentation.

Benchmark Reproducibility

5.0 / 10

The provider reports scores on several standard benchmarks (MMLU, C-Eval, GSM8K, BBH) and provides a comparison to the previous version. While the benchmarks are named, the exact evaluation prompts, few-shot settings, and specific versions of the datasets used are not fully documented in a reproducible format. Third-party verification is available through public leaderboards, but the lack of an official evaluation script or detailed methodology limits full reproducibility.

Identity Consistency

9.0 / 10

The model consistently identifies itself as ChatGLM2-6B or an AI assistant developed by the GLM team. It maintains a clear versioning identity distinct from its predecessor and successor (ChatGLM3). There are no widespread reports of the model claiming to be a competitor's product (like GPT-4) or denying its nature as an AI, though its self-knowledge is limited to what was included in its alignment training.

Downstream

19.0 / 30

License Clarity

6.5 / 10

The model uses a custom 'ChatGLM2-6B License.' While the weights are open for academic research and free commercial use is permitted, it requires users to complete a registration questionnaire for commercial applications. The license includes specific restrictions related to use cases that might 'undermine China's national security,' which introduces some legal ambiguity for international users compared to standard OSI licenses like Apache 2.0.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented. The provider explicitly states VRAM needs for different precision levels, noting that 6GB of VRAM is sufficient for INT4 quantization. Documentation includes specific performance gains from MQA and FlashAttention. Quantization impact is acknowledged, and community-driven tools (like the Hugging Face Model Memory Utility) provide further verifiable data on memory scaling.

Versioning Drift

4.5 / 10

The model follows a clear generational versioning (ChatGLM -> ChatGLM2 -> ChatGLM3). However, within the ChatGLM2-6B lifecycle, updates to checkpoints and code are often pushed to the main branch of the repository without rigorous semantic versioning or detailed changelogs for minor revisions. This makes tracking silent performance drift or behavioral changes difficult for developers relying on the latest 'main' branch.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
16k
32k

VRAM Required:

Recommended GPUs

ChatGLM2-6B: Specifications and GPU VRAM Requirements