ApX logoApX logo

ERNIE-4.5-VL-424B-A47B-Base

Active Parameters

424B

Context Length

131.072K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

30 Jun 2025

Knowledge Cutoff

Jun 2025

Technical Specifications

Total Expert Parameters

47.0B

Number of Experts

128

Active Experts

16

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

4096

Number of Layers

54

Attention Heads

64

Key-Value Heads

8

Activation Function

Swish

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

ERNIE-4.5-VL-424B-A47B-Base

ERNIE-4.5-VL-424B-A47B-Base is the flagship multimodal foundation model in Baidu's ERNIE 4.5 family, characterized by its massive scale and advanced architectural design. This variant functions as a base model, pre-trained for comprehensive cross-modal reasoning and high-fidelity understanding of text, images, and videos. It employs a heterogeneous Mixture-of-Experts (MoE) framework that enables the system to scale to 424 billion parameters while maintaining computational efficiency by activating only 47 billion parameters per token. The model is specifically engineered to handle complex multimodal workflows, including content analysis, sophisticated visual-language reasoning, and long-context information processing across diverse data types.

The technical core of the model revolves around a novel multimodal heterogeneous MoE structure that integrates modality-isolated routing and shared parameter layers. This architecture utilizes modality-specific experts to preserve the unique characteristics of textual and visual data while employing shared attention mechanisms to foster mutual reinforcement between modalities. To ensure stable and balanced learning during large-scale pre-training, the model incorporates a router orthogonal loss and multimodal token-balanced loss, preventing any single modality from dominating the gradient updates. The vision stack is further enhanced by a variable-resolution Vision Transformer (ViT) encoder and an adapter that projects visual features into a unified embedding space, supported by 2D Rotary Position Embeddings (RoPE) for precise spatial grounding.

Optimized for high-performance deployment, ERNIE-4.5-VL-424B-A47B-Base is built upon the PaddlePaddle framework and supports advanced inference techniques like multi-expert parallel collaboration and convolutional code quantization. This enables the model to achieve near-lossless 4-bit and 2-bit quantization, allowing for the deployment of this large-scale system on more accessible hardware configurations. With an expansive context window of 131,072 tokens and support for both thinking and non-thinking inference modes, the model is suitable for industrial-grade applications requiring deep semantic reasoning over long-form documents or intricate video sequences.

About ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.


Other ERNIE 4.5 Models

Evaluation Benchmarks

No evaluation benchmarks for ERNIE-4.5-VL-424B-A47B-Base available.

Rankings

Overall Rank

-

Coding Rank

-

Model Transparency

Total Score

B+

72 / 100

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs