ApX logoApX logo

ERNIE-4.5-VL-424B-A47B-Base

Active Parameters

424B

Context Length

131.072K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

30 Jun 2025

Knowledge Cutoff

Jun 2025

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

64

Key-Value Heads

8

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

500,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

Swish

Dimensions

Hidden Dimension Size

4,096

Number of Layers

54

FFN Intermediate Size (Dense)

28,672

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

103,424

Mixture of Experts

Total Expert Parameters

47.0B

Number of Experts

128

Active Experts

16

Shared Experts

-

FFN Intermediate Size (per Expert)

-

Dense Layers Before MoE

3

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 4.1k · Context: 131.1k · Vocab: 103.4kx 54 layersRMSNormPre-AttentionGrouped-Query Attention64Q / 8KV headsHead dim: 128+RMSNormPre-FFNSparse MoE FFN (16/128 experts)Swish+Final RMSNormOutput Logits

ERNIE-4.5-VL-424B-A47B-Base

ERNIE-4.5-VL-424B-A47B-Base is the flagship multimodal foundation model in Baidu's ERNIE 4.5 family, characterized by its massive scale and advanced architectural design. This variant functions as a base model, pre-trained for comprehensive cross-modal reasoning and high-fidelity understanding of text, images, and videos. It employs a heterogeneous Mixture-of-Experts (MoE) framework that enables the system to scale to 424 billion parameters while maintaining computational efficiency by activating only 47 billion parameters per token. The model is specifically engineered to handle complex multimodal workflows, including content analysis, sophisticated visual-language reasoning, and long-context information processing across diverse data types.

The technical core of the model revolves around a novel multimodal heterogeneous MoE structure that integrates modality-isolated routing and shared parameter layers. This architecture utilizes modality-specific experts to preserve the unique characteristics of textual and visual data while employing shared attention mechanisms to foster mutual reinforcement between modalities. To ensure stable and balanced learning during large-scale pre-training, the model incorporates a router orthogonal loss and multimodal token-balanced loss, preventing any single modality from dominating the gradient updates. The vision stack is further enhanced by a variable-resolution Vision Transformer (ViT) encoder and an adapter that projects visual features into a unified embedding space, supported by 2D Rotary Position Embeddings (RoPE) for precise spatial grounding.

Optimized for high-performance deployment, ERNIE-4.5-VL-424B-A47B-Base is built upon the PaddlePaddle framework and supports advanced inference techniques like multi-expert parallel collaboration and convolutional code quantization. This enables the model to achieve near-lossless 4-bit and 2-bit quantization, allowing for the deployment of this large-scale system on more accessible hardware configurations. With an expansive context window of 131,072 tokens and support for both thinking and non-thinking inference modes, the model is suitable for industrial-grade applications requiring deep semantic reasoning over long-form documents or intricate video sequences.

About ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.


Other ERNIE 4.5 Models

Evaluation Benchmarks

No evaluation benchmarks for ERNIE-4.5-VL-424B-A47B-Base available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B+

72 / 100

ERNIE-4.5-VL-424B-A47B-Base Model Integrity Report

Total Score

72

/ 100

B+

Audit Note

ERNIE 4.5 VL 424B A47B demonstrates a high level of architectural and licensing transparency, providing a detailed technical report and a permissive Apache 2.0 license. The model is exemplary in its disclosure of MoE parameter density, clearly distinguishing between total and active parameters. However, it maintains significant opacity regarding the specific composition of its training datasets and the total compute resources consumed during its development.

Upstream

21.0 / 30

Architectural Provenance

8.0 / 10

The model's architecture is extensively documented in the ERNIE 4.5 Technical Report (June 2025). It details a novel multimodal heterogeneous Mixture-of-Experts (MoE) framework, specifically identifying the use of a 630M parameter ViT encoder with adaptive-resolution and 2D RoPE. The report describes the 'modality-isolated routing' technique and the integration of 128 experts (64 text, 64 vision) with 8 active experts per modality. It also specifies the use of Grouped Query Attention (GQA) and the PaddlePaddle framework for training. While the base model is clearly a 'from-scratch' pre-trained foundation model, the specific initialization of the vision encoder versus the transformer backbone is well-defined.

Dataset Composition

4.5 / 10

Documentation mentions the model was trained on 'trillions of tokens' and incorporates 'high-quality Chinese textual and visual data' alongside global web data. While it describes a 'human-model-in-the-loop' iterative cycle for data refinement and mentions general categories (STEM, Chinese visual knowledge, web), it lacks a precise percentage breakdown of the dataset (e.g., exact ratios of code, web, books). Specific data sources are not named individually, and the filtering methodology is described in high-level conceptual terms rather than reproducible technical specifications.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the Hugging Face repository in both PaddlePaddle and PyTorch (Transformer-style) formats. The technical report and model cards confirm support for a 131,072 token context window. Vocabulary size and tokenization approach are verifiable through the provided `tokenizer_config.json` and `vocab.json` files in the official 'baidu/ERNIE-4.5-VL-424B-A47B-Base-PT' repository, ensuring alignment with the claimed multilingual (Chinese/English) support.

Model

28.5 / 40

Parameter Density

9.0 / 10

Baidu provides exemplary transparency regarding parameter counts for this MoE model. It explicitly states a total of 424 billion parameters with exactly 47 billion active parameters per token. The architectural breakdown is detailed, specifying 128 total experts with a 64/64 split between modalities and the activation of 8 experts per token. This prevents the common 'parameter inflation' ambiguity seen in other MoE releases.

Training Compute

4.0 / 10

The technical report mentions the use of 'heterogeneous hybrid parallelism' and 'FP8 mixed-precision training' on the PaddlePaddle framework. It discloses a Model FLOPs Utilization (MFU) of 47% for the largest language model variant. However, it fails to provide the total number of GPU/TPU hours, the specific hardware cluster size (e.g., number of H100 nodes used for the full pre-training run), or a calculated carbon footprint, which are requirements for a high score in this category.

Benchmark Reproducibility

6.0 / 10

The model provides results on standard benchmarks like MMMU, MathVista, and CV-Bench, and the technical report includes some hyperparameters (e.g., 2 FPS, 480 max frames for video). While evaluation code is partially available through the ERNIEKit and PaddlePaddle repositories, and inference is supported in vLLM, the exact prompts and few-shot examples used to achieve the reported SOTA scores are not fully disclosed in a centralized, reproducible format.

Identity Consistency

9.5 / 10

The model exhibits high identity consistency, correctly identifying as part of the ERNIE 4.5 family. Documentation clearly distinguishes between the 'Base' (pre-trained) and 'Chat' (post-trained) variants, as well as the 'Thinking' and 'Non-thinking' operational modes. Versioning is clear across the 10 distinct variants in the family, and there is no evidence of the model claiming to be a competitor's product.

Downstream

22.5 / 30

License Clarity

10.0 / 10

The model weights, code, and development toolkits are released under the highly permissive Apache License 2.0. This is explicitly stated in the technical report, the GitHub repository, and the Hugging Face model cards. The license allows for both research and commercial use without the restrictive 'acceptable use policies' or 'research-only' clauses often found in other 'open' weights releases.

Hardware Footprint

7.5 / 10

Hardware requirements are well-documented for various configurations. Official guides specify that ~945 GB of VRAM is needed for the 424B model, recommending 12x H100 (80GB) GPUs. It also provides clear guidance on 4-bit and 8-bit quantization (wint4/wint8) using the 'convolutional code quantization' algorithm, noting that 80GB x 8 resources are required for quantized deployment. Context length memory scaling is addressed through the mention of GQA and FlashAttention-like optimizations.

Versioning Drift

5.0 / 10

Baidu uses a clear naming convention (e.g., 4.5-VL-424B-A47B-Base) and maintains a GitHub repository for the ERNIE 4.5 family. However, there is no public, detailed changelog or semantic versioning system for weight updates (e.g., v4.5.1). While the release is recent, the infrastructure for tracking silent weight updates or behavior drift over time is not yet as robust as established open-source projects.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs