ApX logoApX logo

ChatGLM3-6B-32K

Parameters

6B

Context Length

32.768K

Modality

Text

Architecture

Dense

License

ChatGLM3-6B Model License

Release Date

27 Oct 2023

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

32

Key-Value Heads

2

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

-

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

28

FFN Intermediate Size (Dense)

13,696

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

65,024

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 4.1k · Context: 32.8k · Vocab: 65kx 28 layersRMSNormPre-AttentionMulti-Head Attention32Q / 2KV headsHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 13.7k+Final RMSNormOutput Logits

ChatGLM3-6B-32K

ChatGLM3-6B-32K is an advanced large language model optimized for long-context understanding and generation. Developed through a collaboration between Zhipu AI and Tsinghua University's KEG Lab, this model serves as a specialized variant of the ChatGLM3-6B architecture, specifically engineered to extend the effective context window to 32,768 tokens. This expansion allows for the processing of comprehensive documents, long-form dialogues, and complex technical texts that exceed the limits of standard transformer-based models.

The model's architecture is built upon a 28-layer dense transformer framework. It incorporates several technical refinements to maintain stability and performance across its extended context, including the use of RMSNorm for normalization and Multi-Query Attention (MQA) to optimize inference efficiency. A significant innovation in this variant is the updated Rotary Position Embedding (RoPE) mechanism, which utilizes a modified base frequency (rope_ratio) to ensure precise positional resolution over 32K tokens. Furthermore, the model is trained with a specialized methodology that emphasizes long-text coherence during the conversation stage.

Designed for technical versatility, ChatGLM3-6B-32K natively supports tool invocation through function calling, code execution via an integrated code interpreter, and complex agent-based tasks. These features make it highly suitable for building sophisticated AI agents capable of deep text analysis and multi-step reasoning. The model's weights are open for academic research and available for free commercial use following a formal registration process, reflecting a commitment to accessible high-performance natural language processing.

About ChatGLM

ChatGLM series models from Z.ai, based on GLM architecture.


Other ChatGLM Models

Evaluation Benchmarks

No evaluation benchmarks for ChatGLM3-6B-32K available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B-

62 / 100

ChatGLM3-6B-32K Model Integrity Report

Total Score

62

/ 100

B-

Audit Note

ChatGLM3-6B-32K demonstrates strong transparency regarding its architecture and deployment requirements, providing clear guidance for local execution on consumer hardware. However, it suffers from significant opacity in its training data composition and compute resources, relying on vague descriptions of 'diverse data' rather than verifiable metrics. The use of a custom, registration-gated license for commercial use further complicates its status as a truly open model.

Upstream

20.0 / 30

Architectural Provenance

7.5 / 10

The model is explicitly identified as a variant of the ChatGLM3-6B architecture, which is a 28-layer dense transformer. Technical documentation and the associated paper (GLM: General Language Model) describe the core architecture, including the use of RMSNorm and Multi-Query Attention (MQA). This specific 32K variant documents its modifications to the Rotary Position Embedding (RoPE) mechanism, specifically the adjustment of the base frequency (rope_ratio) to support extended context. However, the exact 'specialized methodology' for long-text coherence training is described in general terms rather than full procedural detail.

Dataset Composition

4.0 / 10

While the technical report for the ChatGLM family mentions a pre-training corpus of approximately 10 trillion tokens and identifies general sources such as books and Wikipedia, specific proportions for the ChatGLM3-6B-32K variant's training data are not disclosed. The documentation uses vague terms like 'more diverse training dataset' and 'more sufficient training steps' without providing a verifiable breakdown of the data mixture or specific filtering/cleaning protocols used for this version.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the official GitHub repository and Hugging Face (tokenization_chatglm.py). It uses a byte-level BPE algorithm with a unified vocabulary size of 150,000 tokens, merging Chinese and multilingual tokens with the cl100k_base tokenizer. The implementation is well-documented, and the vocabulary size and special tokens (e.g., [MASK], [gMASK]) are explicitly defined and verifiable through the provided source code.

Model

22.5 / 40

Parameter Density

7.0 / 10

The model is clearly defined as a 6 billion parameter dense transformer. Architectural details such as the number of layers (28), hidden dimension size (4096), and the use of MQA are public. While it is a dense model (so active parameters equal total parameters), the documentation is transparent about the trade-offs made to maintain this size while extending context, such as increasing FFN parameters to compensate for the reduced parameter count in the attention mechanism (GQA/MQA).

Training Compute

2.0 / 10

There is almost no specific disclosure regarding the compute resources used for the ChatGLM3-6B-32K variant. While the general GLM papers discuss hardware for larger models (e.g., A100 clusters for GLM-130B), the specific GPU hours, hardware configuration, training duration, and carbon footprint for this 32K variant are not provided. Claims of 'more sufficient training steps' are unverifiable without these metrics.

Benchmark Reproducibility

4.5 / 10

The model provides scores for standard benchmarks like LongBench, MMLU, and C-Eval. However, the evaluation code and exact prompts used for the 32K variant's long-context testing are not fully public. While some general evaluation results are shared in the README, the lack of a comprehensive, reproducible evaluation suite for the extended context capabilities limits third-party verification.

Identity Consistency

9.0 / 10

The model consistently identifies itself as part of the ChatGLM3 family and is transparent about its specific role as a long-context (32K) variant. It does not exhibit identity confusion with other major models (like GPT-4) and clearly states its versioning (e.g., v1.1.0 in some distributions). Documentation explicitly guides users on when to use the 8K vs. 32K versions based on task requirements.

Downstream

19.0 / 30

License Clarity

6.0 / 10

The model uses a custom 'ChatGLM3-6B Model License'. While it is described as 'open source' in marketing, it is actually an open-weights model with a restrictive license that requires a formal registration via a questionnaire for commercial use. This creates a hurdle for transparency compared to standard OSI licenses like Apache 2.0. The terms are governed by the laws of the People's Republic of China, which may introduce legal ambiguity for international users.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented. VRAM estimates for different quantization levels (FP16, INT8, INT4) are provided, with specific guidance that INT4 allows deployment on consumer GPUs with as little as 6GB-13GB VRAM depending on context usage. The impact of context length on memory scaling is acknowledged, and the repository provides tools for local quantization and deployment (e.g., chatglm.cpp).

Versioning Drift

5.0 / 10

The model uses basic versioning (e.g., ChatGLM3-6B-32K), and the GitHub repository tracks changes. However, there is no detailed, formal changelog or semantic versioning system that documents specific weight updates or behavioral drift over time. Updates are often released as new model cards on Hugging Face without comprehensive delta documentation.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
16k
32k

VRAM Required:

Recommended GPUs