Active Parameters
424B
Context Length
131.072K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
30 Jun 2025
Knowledge Cutoff
Jun 2025
Total Expert Parameters
47.0B
Number of Experts
128
Active Experts
16
Attention Structure
Grouped-Query Attention
Hidden Dimension Size
-
Number of Layers
54
Attention Heads
64
Key-Value Heads
8
Activation Function
-
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
ERNIE-4.5-VL-424B-A47B is a multimodal foundation model developed by Baidu, representing the flagship variant of the ERNIE 4.5 family. It is engineered to process and generate content across textual and visual modalities using a large-scale Mixture of Experts (MoE) framework. By integrating 424 billion total parameters with a sparse activation of 47 billion parameters per token, the model maintains high-capacity representation while optimizing computational throughput. Its design facilitates applications requiring advanced logic, comprehensive document analysis, and sophisticated multimodal conversational interactions.
The model employs a heterogeneous MoE architecture that differentiates between text and vision processing while maintaining a unified hidden state. It incorporates 128 experts in total, including 64 specialized experts for text and 64 for vision, with a routing mechanism that selects 8 active experts per modality for each token. To ensure effective cross-modal integration without performance degradation in specific domains, the system utilizes shared self-attention layers and shared experts alongside modality-isolated routing. The attention mechanism is based on Grouped Query Attention (GQA) with 64 heads and 8 key-value heads, optimized for a context window of 131,072 tokens.
Training and inference are facilitated by the PaddlePaddle deep learning framework, supporting industrial-grade deployment through 4-bit and 2-bit lossless quantization. The architecture supports two distinct operational modes: a standard inference mode for rapid perception tasks and a reasoning-heavy mode for complex logical problems. Primary use cases involve visual question answering, complex chart and document interpretation, and automated multimodal content generation. The inclusion of 2D rotary position embeddings (RoPE) in the vision encoder and absolute position embeddings in the transformer backbone ensures precise spatial and sequential modeling across diverse input types.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
No evaluation benchmarks for ERNIE-4.5-VL-424B-A47B available.
Overall Rank
-
Coding Rank
-
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens