ApX logoApX logo

ERNIE-4.5-VL-424B-A47B

Active Parameters

424B

Context Length

131.072K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

30 Jun 2025

Knowledge Cutoff

Jun 2025

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

64

Key-Value Heads

8

Attention Head Dimension

-

Position Embedding

Absolute Position Embedding

RoPE Theta

500,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

Swish

Dimensions

Hidden Dimension Size

8,192

Number of Layers

54

FFN Intermediate Size (Dense)

28,672

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

103,424

Mixture of Experts

Total Expert Parameters

47.0B

Number of Experts

128

Active Experts

16

Shared Experts

-

FFN Intermediate Size (per Expert)

-

Dense Layers Before MoE

3

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 8.2k · Context: 131.1k · Vocab: 103.4kx 54 layersRMSNormPre-AttentionGrouped-Query Attention64Q / 8KV headsHead dim: 128+RMSNormPre-FFNSparse MoE FFN (16/128 experts)Swish+Final RMSNormOutput Logits

ERNIE-4.5-VL-424B-A47B

ERNIE-4.5-VL-424B-A47B is a multimodal foundation model developed by Baidu, representing the flagship variant of the ERNIE 4.5 family. It is engineered to process and generate content across textual and visual modalities using a large-scale Mixture of Experts (MoE) framework. By integrating 424 billion total parameters with a sparse activation of 47 billion parameters per token, the model maintains high-capacity representation while optimizing computational throughput. Its design facilitates applications requiring advanced logic, comprehensive document analysis, and sophisticated multimodal conversational interactions.

The model employs a heterogeneous MoE architecture that differentiates between text and vision processing while maintaining a unified hidden state. It incorporates 128 experts in total, including 64 specialized experts for text and 64 for vision, with a routing mechanism that selects 8 active experts per modality for each token. To ensure effective cross-modal integration without performance degradation in specific domains, the system utilizes shared self-attention layers and shared experts alongside modality-isolated routing. The attention mechanism is based on Grouped Query Attention (GQA) with 64 heads and 8 key-value heads, optimized for a context window of 131,072 tokens.

Training and inference are facilitated by the PaddlePaddle deep learning framework, supporting industrial-grade deployment through 4-bit and 2-bit lossless quantization. The architecture supports two distinct operational modes: a standard inference mode for rapid perception tasks and a reasoning-heavy mode for complex logical problems. Primary use cases involve visual question answering, complex chart and document interpretation, and automated multimodal content generation. The inclusion of 2D rotary position embeddings (RoPE) in the vision encoder and absolute position embeddings in the transformer backbone ensures precise spatial and sequential modeling across diverse input types.

About ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.


Other ERNIE 4.5 Models

Evaluation Benchmarks

No evaluation benchmarks for ERNIE-4.5-VL-424B-A47B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

70 / 100

ERNIE-4.5-VL-424B-A47B Model Integrity Report

Total Score

70

/ 100

B

Audit Note

ERNIE 4.5 VL-424B-A47B exhibits high transparency in its architectural design and licensing, providing precise parameter counts and a permissive Apache 2.0 license. However, it maintains significant opacity regarding its specific training data composition and the total environmental impact of its compute resources. While the model is well-supported by deployment toolkits, more granular disclosure of evaluation prompts and data provenance is needed to reach exemplary transparency levels.

Upstream

19.5 / 30

Architectural Provenance

8.0 / 10

The model's architecture is extensively documented in the ERNIE 4.5 Technical Report (June 2025). It details a novel 'multimodal heterogeneous MoE' structure that differentiates between text and vision processing while sharing self-attention layers. Specific architectural components are disclosed, including a 630M parameter ViT encoder, 128 total experts (64 text, 64 vision), and the use of Grouped Query Attention (GQA) with 64 heads. The report also specifies the use of 2D RoPE for vision and absolute position embeddings for the backbone.

Dataset Composition

3.0 / 10

While the technical report mentions the model was trained on 'trillions of tokens' and 'high-quality Chinese textual and visual data,' it lacks a specific breakdown of dataset sources, proportions, or detailed filtering methodologies. The description remains at a high level (e.g., 'extensive pre-training on visual concepts'), which falls into the category of 'general data categories mentioned' without verifiable composition details.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the PaddlePaddle and Hugging Face repositories. Documentation specifies a vocabulary size of 103,424 tokens. The tokenizer is aligned with the model's multilingual (Chinese/English) focus and is integrated into standard inference frameworks like vLLM and FastDeploy, allowing for direct verification of tokenization behavior.

Model

28.0 / 40

Parameter Density

9.0 / 10

Baidu provides precise figures for both total and active parameters: 424 billion total parameters with 47 billion active parameters per token. The documentation further breaks down the MoE structure (128 experts, 8 active per modality) and the vision encoder (630M parameters). This level of detail for a sparse architecture is exemplary.

Training Compute

4.0 / 10

The technical report discloses the use of 2,016 NVIDIA H800 GPUs for pre-training the largest language model variant and mentions a Model FLOPs Utilization (MFU) of 47%. However, it does not provide the total training duration in hours, the specific energy consumption, or a calculated carbon footprint for the VL-424B variant specifically, leaving significant gaps in environmental transparency.

Benchmark Reproducibility

6.0 / 10

Baidu reports results on standard benchmarks (MathVista, MMMU, MMLU-Pro, GSM8K) and provides some evaluation details in the technical report. While they open-source the 'ERNIEKit' for fine-tuning and 'FastDeploy' for inference, the exact evaluation scripts and full prompt sets used to achieve the reported state-of-the-art scores are not fully centralized for one-click reproduction.

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, with clear versioning (ERNIE 4.5) and variant labeling (VL-424B-A47B). It distinguishes between 'thinking' and 'non-thinking' modes in its system prompts and API parameters. Documentation and model cards consistently reflect its capabilities as a multimodal MoE model without claiming identities of other providers.

Downstream

22.0 / 30

License Clarity

9.5 / 10

The model weights and associated development toolkits (ERNIEKit, FastDeploy) are explicitly released under the Apache License 2.0, as verified across GitHub, Hugging Face, and the official technical report. This is a highly permissive, standard open-source license that clearly allows for both research and commercial use.

Hardware Footprint

7.5 / 10

Hardware requirements are well-documented for various precisions. Documentation specifies that 8x 80GB GPUs are required for 4-bit quantization, while native BF16 requires significantly more (up to 16x 80GB or 8x 140GB). The impact of 2-bit and 4-bit 'lossless' quantization is discussed, and VRAM estimates are provided for different deployment scenarios.

Versioning Drift

5.0 / 10

Baidu uses a clear naming convention and versioning for the ERNIE 4.5 family. However, there is limited public information regarding a formal changelog for weight updates or a public policy for managing model drift over time. While the release is recent, the infrastructure for tracking long-term version history and deprecation is not as transparent as its architectural documentation.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs