Active Parameters
424B
Context Length
131.072K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
30 Jun 2025
Knowledge Cutoff
Jun 2025
Total Expert Parameters
47.0B
Number of Experts
128
Active Experts
16
Attention Structure
Grouped-Query Attention
Hidden Dimension Size
-
Number of Layers
54
Attention Heads
64
Key-Value Heads
8
Activation Function
-
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
VRAM requirements for different quantization methods and context sizes
ERNIE 4.5 is a large-scale multimodal foundation model family developed by Baidu, designed to integrate and process information across both textual and visual modalities. The ERNIE-4.5-VL-424B-A47B variant is specifically engineered for advanced comprehension and generation capabilities, supporting applications that demand complex understanding and creative output from diverse data types. These applications include sophisticated conversational AI, multimodal content creation, and intelligent analysis systems, all aiming to provide high performance across a wide spectrum of tasks.
This model variant employs a heterogeneous Mixture of Experts (MoE) architecture, comprising 424 billion total parameters with 47 billion activated parameters per token. A key architectural innovation is its novel design that supports parameter sharing across modalities while also allowing for dedicated expert parameters for each individual modality. This structure enhances multimodal understanding without compromising performance on text-related tasks. The model features 54 layers and utilizes Grouped Query Attention (GQA) with 64 attention heads and 8 key-value heads. Its positional encoding strategy integrates multimodal positional embeddings for unified hidden states and incorporates 2D rotary position embedding (RoPE) within the vision encoder. The system routes text and vision features to distinct sets of experts while simultaneously using shared experts and self-attention layers for all tokens, thereby facilitating cross-modal knowledge integration. The architecture includes 64 distinct text experts and 64 distinct vision experts, with 8 active experts selected for each modality per token. Furthermore, it incorporates modality-isolated routing, router orthogonal loss, and multimodal token-balanced loss to optimize training and prevent interference between modalities.
ERNIE-4.5-VL-424B-A47B is engineered for tasks requiring cross-modal comprehension and generation, supporting a context length of 131,072 tokens. Its substantial parameter base and efficient MoE design enable the model to process extensive and complex inputs, fostering deep semantic understanding and coherent long-form generation across both text and images. The model offers distinct "thinking" and "non-thinking" modes to accommodate varied reasoning approaches. Potential use cases encompass multimodal content generation, advanced dialogue systems, comprehensive visual question answering, document and chart understanding, and general multimodal analysis where the synthesis of different data types is critical. For enhanced inference efficiency, the model supports deployment with quantization, including 4-bit and 2-bit lossless quantization. The entire ERNIE 4.5 family, including this variant, is built on the PaddlePaddle deep learning framework, which contributes to its high-performance inference and streamlined deployment capabilities.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
Ranking is for Local LLMs.
No evaluation benchmarks for ERNIE-4.5-VL-424B-A47B available.
Overall Rank
-
Coding Rank
-
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens