Active Parameters
424B
Context Length
131.072K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
30 Jun 2025
Knowledge Cutoff
Jun 2025
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
64
Key-Value Heads
8
Attention Head Dimension
-
Position Embedding
Absolute Position Embedding
RoPE Theta
500,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
Swish
Dimensions
Hidden Dimension Size
8,192
Number of Layers
54
FFN Intermediate Size (Dense)
28,672
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
103,424
Mixture of Experts
Total Expert Parameters
47.0B
Number of Experts
128
Active Experts
16
Shared Experts
-
FFN Intermediate Size (per Expert)
-
Dense Layers Before MoE
3
ERNIE-4.5-VL-424B-A47B is a multimodal foundation model developed by Baidu, representing the flagship variant of the ERNIE 4.5 family. It is engineered to process and generate content across textual and visual modalities using a large-scale Mixture of Experts (MoE) framework. By integrating 424 billion total parameters with a sparse activation of 47 billion parameters per token, the model maintains high-capacity representation while optimizing computational throughput. Its design facilitates applications requiring advanced logic, comprehensive document analysis, and sophisticated multimodal conversational interactions.
The model employs a heterogeneous MoE architecture that differentiates between text and vision processing while maintaining a unified hidden state. It incorporates 128 experts in total, including 64 specialized experts for text and 64 for vision, with a routing mechanism that selects 8 active experts per modality for each token. To ensure effective cross-modal integration without performance degradation in specific domains, the system utilizes shared self-attention layers and shared experts alongside modality-isolated routing. The attention mechanism is based on Grouped Query Attention (GQA) with 64 heads and 8 key-value heads, optimized for a context window of 131,072 tokens.
Training and inference are facilitated by the PaddlePaddle deep learning framework, supporting industrial-grade deployment through 4-bit and 2-bit lossless quantization. The architecture supports two distinct operational modes: a standard inference mode for rapid perception tasks and a reasoning-heavy mode for complex logical problems. Primary use cases involve visual question answering, complex chart and document interpretation, and automated multimodal content generation. The inclusion of 2D rotary position embeddings (RoPE) in the vision encoder and absolute position embeddings in the transformer backbone ensures precise spatial and sequential modeling across diverse input types.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
No evaluation benchmarks for ERNIE-4.5-VL-424B-A47B available.
Overall Rank
-
Coding Rank
-
Total Score
70
/ 100
ERNIE 4.5 VL-424B-A47B exhibits high transparency in its architectural design and licensing, providing precise parameter counts and a permissive Apache 2.0 license. However, it maintains significant opacity regarding its specific training data composition and the total environmental impact of its compute resources. While the model is well-supported by deployment toolkits, more granular disclosure of evaluation prompts and data provenance is needed to reach exemplary transparency levels.
Architectural Provenance
The model's architecture is extensively documented in the ERNIE 4.5 Technical Report (June 2025). It details a novel 'multimodal heterogeneous MoE' structure that differentiates between text and vision processing while sharing self-attention layers. Specific architectural components are disclosed, including a 630M parameter ViT encoder, 128 total experts (64 text, 64 vision), and the use of Grouped Query Attention (GQA) with 64 heads. The report also specifies the use of 2D RoPE for vision and absolute position embeddings for the backbone.
Dataset Composition
While the technical report mentions the model was trained on 'trillions of tokens' and 'high-quality Chinese textual and visual data,' it lacks a specific breakdown of dataset sources, proportions, or detailed filtering methodologies. The description remains at a high level (e.g., 'extensive pre-training on visual concepts'), which falls into the category of 'general data categories mentioned' without verifiable composition details.
Tokenizer Integrity
The tokenizer is publicly accessible via the PaddlePaddle and Hugging Face repositories. Documentation specifies a vocabulary size of 103,424 tokens. The tokenizer is aligned with the model's multilingual (Chinese/English) focus and is integrated into standard inference frameworks like vLLM and FastDeploy, allowing for direct verification of tokenization behavior.
Parameter Density
Baidu provides precise figures for both total and active parameters: 424 billion total parameters with 47 billion active parameters per token. The documentation further breaks down the MoE structure (128 experts, 8 active per modality) and the vision encoder (630M parameters). This level of detail for a sparse architecture is exemplary.
Training Compute
The technical report discloses the use of 2,016 NVIDIA H800 GPUs for pre-training the largest language model variant and mentions a Model FLOPs Utilization (MFU) of 47%. However, it does not provide the total training duration in hours, the specific energy consumption, or a calculated carbon footprint for the VL-424B variant specifically, leaving significant gaps in environmental transparency.
Benchmark Reproducibility
Baidu reports results on standard benchmarks (MathVista, MMMU, MMLU-Pro, GSM8K) and provides some evaluation details in the technical report. While they open-source the 'ERNIEKit' for fine-tuning and 'FastDeploy' for inference, the exact evaluation scripts and full prompt sets used to achieve the reported state-of-the-art scores are not fully centralized for one-click reproduction.
Identity Consistency
The model demonstrates high identity consistency, with clear versioning (ERNIE 4.5) and variant labeling (VL-424B-A47B). It distinguishes between 'thinking' and 'non-thinking' modes in its system prompts and API parameters. Documentation and model cards consistently reflect its capabilities as a multimodal MoE model without claiming identities of other providers.
License Clarity
The model weights and associated development toolkits (ERNIEKit, FastDeploy) are explicitly released under the Apache License 2.0, as verified across GitHub, Hugging Face, and the official technical report. This is a highly permissive, standard open-source license that clearly allows for both research and commercial use.
Hardware Footprint
Hardware requirements are well-documented for various precisions. Documentation specifies that 8x 80GB GPUs are required for 4-bit quantization, while native BF16 requires significantly more (up to 16x 80GB or 8x 140GB). The impact of 2-bit and 4-bit 'lossless' quantization is discussed, and VRAM estimates are provided for different deployment scenarios.
Versioning Drift
Baidu uses a clear naming convention and versioning for the ERNIE 4.5 family. However, there is limited public information regarding a formal changelog for weight updates or a public policy for managing model drift over time. While the release is recent, the infrastructure for tracking long-term version history and deprecation is not as transparent as its architectural documentation.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online