Active Parameters
424B
Context Length
131.072K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
30 Jun 2025
Knowledge Cutoff
Jun 2025
Total Expert Parameters
47.0B
Number of Experts
128
Active Experts
16
Attention Structure
Grouped-Query Attention
Hidden Dimension Size
4096
Number of Layers
54
Attention Heads
64
Key-Value Heads
8
Activation Function
Swish
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
ERNIE-4.5-VL-424B-A47B-Base is the flagship multimodal foundation model in Baidu's ERNIE 4.5 family, characterized by its massive scale and advanced architectural design. This variant functions as a base model, pre-trained for comprehensive cross-modal reasoning and high-fidelity understanding of text, images, and videos. It employs a heterogeneous Mixture-of-Experts (MoE) framework that enables the system to scale to 424 billion parameters while maintaining computational efficiency by activating only 47 billion parameters per token. The model is specifically engineered to handle complex multimodal workflows, including content analysis, sophisticated visual-language reasoning, and long-context information processing across diverse data types.
The technical core of the model revolves around a novel multimodal heterogeneous MoE structure that integrates modality-isolated routing and shared parameter layers. This architecture utilizes modality-specific experts to preserve the unique characteristics of textual and visual data while employing shared attention mechanisms to foster mutual reinforcement between modalities. To ensure stable and balanced learning during large-scale pre-training, the model incorporates a router orthogonal loss and multimodal token-balanced loss, preventing any single modality from dominating the gradient updates. The vision stack is further enhanced by a variable-resolution Vision Transformer (ViT) encoder and an adapter that projects visual features into a unified embedding space, supported by 2D Rotary Position Embeddings (RoPE) for precise spatial grounding.
Optimized for high-performance deployment, ERNIE-4.5-VL-424B-A47B-Base is built upon the PaddlePaddle framework and supports advanced inference techniques like multi-expert parallel collaboration and convolutional code quantization. This enables the model to achieve near-lossless 4-bit and 2-bit quantization, allowing for the deployment of this large-scale system on more accessible hardware configurations. With an expansive context window of 131,072 tokens and support for both thinking and non-thinking inference modes, the model is suitable for industrial-grade applications requiring deep semantic reasoning over long-form documents or intricate video sequences.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
No evaluation benchmarks for ERNIE-4.5-VL-424B-A47B-Base available.
Overall Rank
-
Coding Rank
-
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens