ApX logoApX logo

ERNIE-4.5-VL-424B-A47B

Active Parameters

424B

Context Length

131.072K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

30 Jun 2025

Knowledge Cutoff

Jun 2025

Technical Specifications

Total Expert Parameters

47.0B

Number of Experts

128

Active Experts

16

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

-

Number of Layers

54

Attention Heads

64

Key-Value Heads

8

Activation Function

-

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

ERNIE-4.5-VL-424B-A47B

ERNIE-4.5-VL-424B-A47B is a multimodal foundation model developed by Baidu, representing the flagship variant of the ERNIE 4.5 family. It is engineered to process and generate content across textual and visual modalities using a large-scale Mixture of Experts (MoE) framework. By integrating 424 billion total parameters with a sparse activation of 47 billion parameters per token, the model maintains high-capacity representation while optimizing computational throughput. Its design facilitates applications requiring advanced logic, comprehensive document analysis, and sophisticated multimodal conversational interactions.

The model employs a heterogeneous MoE architecture that differentiates between text and vision processing while maintaining a unified hidden state. It incorporates 128 experts in total, including 64 specialized experts for text and 64 for vision, with a routing mechanism that selects 8 active experts per modality for each token. To ensure effective cross-modal integration without performance degradation in specific domains, the system utilizes shared self-attention layers and shared experts alongside modality-isolated routing. The attention mechanism is based on Grouped Query Attention (GQA) with 64 heads and 8 key-value heads, optimized for a context window of 131,072 tokens.

Training and inference are facilitated by the PaddlePaddle deep learning framework, supporting industrial-grade deployment through 4-bit and 2-bit lossless quantization. The architecture supports two distinct operational modes: a standard inference mode for rapid perception tasks and a reasoning-heavy mode for complex logical problems. Primary use cases involve visual question answering, complex chart and document interpretation, and automated multimodal content generation. The inclusion of 2D rotary position embeddings (RoPE) in the vision encoder and absolute position embeddings in the transformer backbone ensures precise spatial and sequential modeling across diverse input types.

About ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.


Other ERNIE 4.5 Models

Evaluation Benchmarks

No evaluation benchmarks for ERNIE-4.5-VL-424B-A47B available.

Rankings

Overall Rank

-

Coding Rank

-

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs