Active Parameters
28B
Context Length
131.072K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
30 Jun 2025
Knowledge Cutoff
Dec 2024
Total Expert Parameters
3.0B
Number of Experts
130
Active Experts
14
Attention Structure
Grouped-Query Attention
Hidden Dimension Size
3584
Number of Layers
28
Attention Heads
20
Key-Value Heads
4
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
ERNIE-4.5-VL-28B-A3B is a multimodal Mixture-of-Experts (MoE) foundation model developed by Baidu to provide advanced vision-language understanding within an efficient computational envelope. This model variant is designed to bridge the gap between high-capacity reasoning and deployable inference by activating only a subset of its total parameters during any given forward pass. It supports sophisticated multimodal tasks including document and chart interpretation, fine-grained visual perception, and temporal analysis of video sequences. A distinguishing feature is its integration of a 'thinking' mode, which utilizes multi-step reasoning processes to address complex queries that require a deeper semantic alignment between visual and textual data.
Technically, the model is built upon a heterogeneous MoE architecture that facilitates joint pre-training on disparate modalities without interference. This is achieved through modality-isolated routing and the application of router orthogonal loss and multimodal token-balanced loss, ensuring that vision and language experts develop specialized representations while reinforcing mutual understanding. The visual component utilizes a variable-resolution Vision Transformer (ViT) encoder that projects visual features into a shared embedding space. The architecture incorporates Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE) to manage its extensive 131,072-token context length, while post-training optimizations such as Direct Preference Optimization (DPO) and Reinforcement Learning with Verifiable Rewards (RLVR) further refine its alignment and reasoning accuracy.
From a performance and deployment perspective, ERNIE-4.5-VL-28B-A3B is engineered for high throughput and multi-hardware compatibility using the PaddlePaddle framework. It supports 4-bit and 2-bit lossless quantization through convolutional code quantization, enabling efficient execution on hardware with limited memory. The model's reasoning capabilities are enhanced by 'Thinking with Images' functionality, allowing the system to autonomously call tools such as image zooming or external searches to resolve fine-grained details or long-tail visual knowledge. These attributes make it particularly effective for enterprise-grade multimodal agents, industrial visual grounding, and STEM-focused problem-solving scenarios.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
No evaluation benchmarks for ERNIE-4.5-VL-28B-A3B available.
Overall Rank
-
Coding Rank
-
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens