Parameters
6B
Context Length
32.768K
Modality
Text
Architecture
Dense
License
ChatGLM3-6B Model License
Release Date
27 Oct 2023
Knowledge Cutoff
-
Attention
Attention Structure
Multi-Head Attention
Attention Heads
32
Key-Value Heads
2
Attention Head Dimension
128
Position Embedding
Absolute Position Embedding
RoPE Theta
-
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
4,096
Number of Layers
28
FFN Intermediate Size (Dense)
13,696
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
65,024
ChatGLM3-6B-32K is an advanced large language model optimized for long-context understanding and generation. Developed through a collaboration between Zhipu AI and Tsinghua University's KEG Lab, this model serves as a specialized variant of the ChatGLM3-6B architecture, specifically engineered to extend the effective context window to 32,768 tokens. This expansion allows for the processing of comprehensive documents, long-form dialogues, and complex technical texts that exceed the limits of standard transformer-based models.
The model's architecture is built upon a 28-layer dense transformer framework. It incorporates several technical refinements to maintain stability and performance across its extended context, including the use of RMSNorm for normalization and Multi-Query Attention (MQA) to optimize inference efficiency. A significant innovation in this variant is the updated Rotary Position Embedding (RoPE) mechanism, which utilizes a modified base frequency (rope_ratio) to ensure precise positional resolution over 32K tokens. Furthermore, the model is trained with a specialized methodology that emphasizes long-text coherence during the conversation stage.
Designed for technical versatility, ChatGLM3-6B-32K natively supports tool invocation through function calling, code execution via an integrated code interpreter, and complex agent-based tasks. These features make it highly suitable for building sophisticated AI agents capable of deep text analysis and multi-step reasoning. The model's weights are open for academic research and available for free commercial use following a formal registration process, reflecting a commitment to accessible high-performance natural language processing.
ChatGLM series models from Z.ai, based on GLM architecture.
No evaluation benchmarks for ChatGLM3-6B-32K available.
Overall Rank
-
Coding Rank
-
Total Score
62
/ 100
ChatGLM3-6B-32K demonstrates strong transparency regarding its architecture and deployment requirements, providing clear guidance for local execution on consumer hardware. However, it suffers from significant opacity in its training data composition and compute resources, relying on vague descriptions of 'diverse data' rather than verifiable metrics. The use of a custom, registration-gated license for commercial use further complicates its status as a truly open model.
Architectural Provenance
The model is explicitly identified as a variant of the ChatGLM3-6B architecture, which is a 28-layer dense transformer. Technical documentation and the associated paper (GLM: General Language Model) describe the core architecture, including the use of RMSNorm and Multi-Query Attention (MQA). This specific 32K variant documents its modifications to the Rotary Position Embedding (RoPE) mechanism, specifically the adjustment of the base frequency (rope_ratio) to support extended context. However, the exact 'specialized methodology' for long-text coherence training is described in general terms rather than full procedural detail.
Dataset Composition
While the technical report for the ChatGLM family mentions a pre-training corpus of approximately 10 trillion tokens and identifies general sources such as books and Wikipedia, specific proportions for the ChatGLM3-6B-32K variant's training data are not disclosed. The documentation uses vague terms like 'more diverse training dataset' and 'more sufficient training steps' without providing a verifiable breakdown of the data mixture or specific filtering/cleaning protocols used for this version.
Tokenizer Integrity
The tokenizer is publicly accessible via the official GitHub repository and Hugging Face (tokenization_chatglm.py). It uses a byte-level BPE algorithm with a unified vocabulary size of 150,000 tokens, merging Chinese and multilingual tokens with the cl100k_base tokenizer. The implementation is well-documented, and the vocabulary size and special tokens (e.g., [MASK], [gMASK]) are explicitly defined and verifiable through the provided source code.
Parameter Density
The model is clearly defined as a 6 billion parameter dense transformer. Architectural details such as the number of layers (28), hidden dimension size (4096), and the use of MQA are public. While it is a dense model (so active parameters equal total parameters), the documentation is transparent about the trade-offs made to maintain this size while extending context, such as increasing FFN parameters to compensate for the reduced parameter count in the attention mechanism (GQA/MQA).
Training Compute
There is almost no specific disclosure regarding the compute resources used for the ChatGLM3-6B-32K variant. While the general GLM papers discuss hardware for larger models (e.g., A100 clusters for GLM-130B), the specific GPU hours, hardware configuration, training duration, and carbon footprint for this 32K variant are not provided. Claims of 'more sufficient training steps' are unverifiable without these metrics.
Benchmark Reproducibility
The model provides scores for standard benchmarks like LongBench, MMLU, and C-Eval. However, the evaluation code and exact prompts used for the 32K variant's long-context testing are not fully public. While some general evaluation results are shared in the README, the lack of a comprehensive, reproducible evaluation suite for the extended context capabilities limits third-party verification.
Identity Consistency
The model consistently identifies itself as part of the ChatGLM3 family and is transparent about its specific role as a long-context (32K) variant. It does not exhibit identity confusion with other major models (like GPT-4) and clearly states its versioning (e.g., v1.1.0 in some distributions). Documentation explicitly guides users on when to use the 8K vs. 32K versions based on task requirements.
License Clarity
The model uses a custom 'ChatGLM3-6B Model License'. While it is described as 'open source' in marketing, it is actually an open-weights model with a restrictive license that requires a formal registration via a questionnaire for commercial use. This creates a hurdle for transparency compared to standard OSI licenses like Apache 2.0. The terms are governed by the laws of the People's Republic of China, which may introduce legal ambiguity for international users.
Hardware Footprint
Hardware requirements are well-documented. VRAM estimates for different quantization levels (FP16, INT8, INT4) are provided, with specific guidance that INT4 allows deployment on consumer GPUs with as little as 6GB-13GB VRAM depending on context usage. The impact of context length on memory scaling is acknowledged, and the repository provides tools for local quantization and deployment (e.g., chatglm.cpp).
Versioning Drift
The model uses basic versioning (e.g., ChatGLM3-6B-32K), and the GitHub repository tracks changes. However, there is no detailed, formal changelog or semantic versioning system that documents specific weight updates or behavioral drift over time. Updates are often released as new model cards on Hugging Face without comprehensive delta documentation.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online