ERNIE-4.5-VL-424B-A47B: Specifications and GPU VRAM Requirements

ERNIE-4.5-VL-424B-A47B

Open Source

Open Weights

Active Parameters

424B

Context Length

131.072K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

30 Jun 2025

Knowledge Cutoff

Jun 2025

Technical Specifications

Total Expert Parameters

47.0B

Number of Experts

128

Active Experts

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

System Requirements

VRAM requirements for different quantization methods and context sizes

ERNIE-4.5-VL-424B-A47B

ERNIE 4.5 is a large-scale multimodal foundation model family developed by Baidu, designed to integrate and process information across both textual and visual modalities. The ERNIE-4.5-VL-424B-A47B variant is specifically engineered for advanced comprehension and generation capabilities, supporting applications that demand complex understanding and creative output from diverse data types. These applications include sophisticated conversational AI, multimodal content creation, and intelligent analysis systems, all aiming to provide high performance across a wide spectrum of tasks.

This model variant employs a heterogeneous Mixture of Experts (MoE) architecture, comprising 424 billion total parameters with 47 billion activated parameters per token. A key architectural innovation is its novel design that supports parameter sharing across modalities while also allowing for dedicated expert parameters for each individual modality. This structure enhances multimodal understanding without compromising performance on text-related tasks. The model features 54 layers and utilizes Grouped Query Attention (GQA) with 64 attention heads and 8 key-value heads. Its positional encoding strategy integrates multimodal positional embeddings for unified hidden states and incorporates 2D rotary position embedding (RoPE) within the vision encoder. The system routes text and vision features to distinct sets of experts while simultaneously using shared experts and self-attention layers for all tokens, thereby facilitating cross-modal knowledge integration. The architecture includes 64 distinct text experts and 64 distinct vision experts, with 8 active experts selected for each modality per token. Furthermore, it incorporates modality-isolated routing, router orthogonal loss, and multimodal token-balanced loss to optimize training and prevent interference between modalities.

ERNIE-4.5-VL-424B-A47B is engineered for tasks requiring cross-modal comprehension and generation, supporting a context length of 131,072 tokens. Its substantial parameter base and efficient MoE design enable the model to process extensive and complex inputs, fostering deep semantic understanding and coherent long-form generation across both text and images. The model offers distinct "thinking" and "non-thinking" modes to accommodate varied reasoning approaches. Potential use cases encompass multimodal content generation, advanced dialogue systems, comprehensive visual question answering, document and chart understanding, and general multimodal analysis where the synthesis of different data types is critical. For enhanced inference efficiency, the model supports deployment with quantization, including 4-bit and 2-bit lossless quantization. The entire ERNIE 4.5 family, including this variant, is built on the PaddlePaddle deep learning framework, which contributes to its high-performance inference and streamlined deployment capabilities.

About ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.

Other ERNIE 4.5 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

No evaluation benchmarks for ERNIE-4.5-VL-424B-A47B available.

Rankings

Overall Rank

Coding Rank

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

64k

128k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Read the Paper Download Weights Source Code