Parameters
9B
Context Length
128K
Modality
Multimodal
Architecture
Dense
License
MIT License
Release Date
15 Jan 2024
Knowledge Cutoff
-
Attention Structure
Multi-Head Attention
Hidden Dimension Size
4096
Number of Layers
40
Attention Heads
32
Key-Value Heads
32
Activation Function
-
Normalization
-
Position Embedding
Absolute Position Embedding
VRAM requirements for different quantization methods and context sizes
The GLM-4V model variant, developed by Z.ai, represents a significant advancement in multimodal artificial intelligence. It is a member of the GLM-4 series, designed to process and interpret both high-resolution image and video data alongside textual input. This architecture facilitates a deep integration of visual and linguistic features, enabling the model to perform complex multimodal tasks without degradation in natural language processing capabilities. The design goal is to provide a unified framework for understanding diverse data modalities.
Technically, GLM-4V incorporates a sophisticated architecture that includes a Visual Encoder, an MLP Projector, and a Language Decoder. The Visual Encoder processes visual inputs, including images and videos, often utilizing a modified Vision Transformer (ViT) and handling arbitrary image aspect ratios and resolutions up to 4K pixels. The MLP Projector serves as an intermediary, translating visual features into a format compatible with the language model, and may incorporate techniques like 3D-RoPE for enhanced spatial understanding. The Language Decoder is based on the underlying GLM architecture, responsible for generating coherent textual responses by integrating the processed visual and textual information.
GLM-4V is engineered to support a range of practical applications, including visual question answering, image captioning, and complex object detection. Its capabilities extend to video understanding, where it incorporates temporal information to analyze sequences effectively. The model's design focuses on enabling robust performance in tasks requiring both visual perception and advanced linguistic reasoning, such as interactive tutoring for STEM subjects or generating step-by-step solutions from visual problems.
General Language Models from Z.ai
Ranking is for Local LLMs.
No evaluation benchmarks for GLM-4V available.
Overall Rank
-
Coding Rank
-
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens