Active Parameters
28B
Context Length
131.072K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
30 Jun 2025
Knowledge Cutoff
-
Total Expert Parameters
3.0B
Number of Experts
130
Active Experts
14
Attention Structure
Grouped-Query Attention
Hidden Dimension Size
-
Number of Layers
28
Attention Heads
20
Key-Value Heads
4
Activation Function
-
Normalization
-
Position Embedding
Absolute Position Embedding
VRAM requirements for different quantization methods and context sizes
The ERNIE-4.5-VL-28B-A3B is a member of the Baidu ERNIE 4.5 model family, a recent collection of large-scale multimodal foundation models. This specific variant functions as a lightweight vision-language model, engineered to process both textual and visual inputs. Its core purpose involves enabling advanced multimodal understanding, encompassing tasks such as image comprehension, text generation informed by visual context, and cross-modal reasoning. The model aims to achieve an equilibrium between performance efficacy and computational resource efficiency, rendering it suitable for enterprise applications and diverse real-world deployment scenarios that necessitate robust multimodal capabilities.
Architecturally, ERNIE-4.5-VL-28B-A3B is constructed upon a fine-grained Mixture-of-Experts (MoE) backbone, a key innovation across the ERNIE 4.5 series. This heterogeneous MoE structure facilitates joint training on textual and visual modalities. It incorporates modality-isolated routing and employs techniques such as router orthogonal loss and multimodal token-balanced loss to prevent interference between modalities and ensure effective representation and mutual reinforcement during training. The model further benefits from modality-specific post-training optimizations, including supervised fine-tuning, direct preference optimization, and Reinforcement Learning with Verifiable Rewards (RLVR), to enhance its performance in vision-language tasks. Visual inputs are processed by a variable-resolution Vision Transformer (ViT) encoder, with representations then projected into a shared embedding space via an adapter.
For performance characteristics, the ERNIE-4.5-VL-28B-A3B supports both "thinking" and "non-thinking" modes, offering flexibility in reasoning approaches. The model demonstrates proficiency in visual perception, document and chart understanding, and visual knowledge, maintaining strong performance across relevant tasks. Efficient inference is achieved through methods like multi-expert parallel collaboration and convolutional code quantization, enabling 4-bit/2-bit lossless quantization for deployment across a range of hardware platforms. The model can process long-form text inputs with a substantial context length, supporting extended conversations and complex reasoning that combines textual knowledge with visual perception.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
Ranking is for Local LLMs.
No evaluation benchmarks for ERNIE-4.5-VL-28B-A3B available.
Overall Rank
-
Coding Rank
-
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens