Active Parameters
424B
Context Length
131.072K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
30 Jun 2025
Knowledge Cutoff
Jun 2025
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
64
Key-Value Heads
8
Attention Head Dimension
128
Position Embedding
Absolute Position Embedding
RoPE Theta
500,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
Swish
Dimensions
Hidden Dimension Size
4,096
Number of Layers
54
FFN Intermediate Size (Dense)
28,672
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
103,424
Mixture of Experts
Total Expert Parameters
47.0B
Number of Experts
128
Active Experts
16
Shared Experts
-
FFN Intermediate Size (per Expert)
-
Dense Layers Before MoE
3
ERNIE-4.5-VL-424B-A47B-Base is the flagship multimodal foundation model in Baidu's ERNIE 4.5 family, characterized by its massive scale and advanced architectural design. This variant functions as a base model, pre-trained for comprehensive cross-modal reasoning and high-fidelity understanding of text, images, and videos. It employs a heterogeneous Mixture-of-Experts (MoE) framework that enables the system to scale to 424 billion parameters while maintaining computational efficiency by activating only 47 billion parameters per token. The model is specifically engineered to handle complex multimodal workflows, including content analysis, sophisticated visual-language reasoning, and long-context information processing across diverse data types.
The technical core of the model revolves around a novel multimodal heterogeneous MoE structure that integrates modality-isolated routing and shared parameter layers. This architecture utilizes modality-specific experts to preserve the unique characteristics of textual and visual data while employing shared attention mechanisms to foster mutual reinforcement between modalities. To ensure stable and balanced learning during large-scale pre-training, the model incorporates a router orthogonal loss and multimodal token-balanced loss, preventing any single modality from dominating the gradient updates. The vision stack is further enhanced by a variable-resolution Vision Transformer (ViT) encoder and an adapter that projects visual features into a unified embedding space, supported by 2D Rotary Position Embeddings (RoPE) for precise spatial grounding.
Optimized for high-performance deployment, ERNIE-4.5-VL-424B-A47B-Base is built upon the PaddlePaddle framework and supports advanced inference techniques like multi-expert parallel collaboration and convolutional code quantization. This enables the model to achieve near-lossless 4-bit and 2-bit quantization, allowing for the deployment of this large-scale system on more accessible hardware configurations. With an expansive context window of 131,072 tokens and support for both thinking and non-thinking inference modes, the model is suitable for industrial-grade applications requiring deep semantic reasoning over long-form documents or intricate video sequences.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
No evaluation benchmarks for ERNIE-4.5-VL-424B-A47B-Base available.
Overall Rank
-
Coding Rank
-
Total Score
72
/ 100
ERNIE 4.5 VL 424B A47B demonstrates a high level of architectural and licensing transparency, providing a detailed technical report and a permissive Apache 2.0 license. The model is exemplary in its disclosure of MoE parameter density, clearly distinguishing between total and active parameters. However, it maintains significant opacity regarding the specific composition of its training datasets and the total compute resources consumed during its development.
Architectural Provenance
The model's architecture is extensively documented in the ERNIE 4.5 Technical Report (June 2025). It details a novel multimodal heterogeneous Mixture-of-Experts (MoE) framework, specifically identifying the use of a 630M parameter ViT encoder with adaptive-resolution and 2D RoPE. The report describes the 'modality-isolated routing' technique and the integration of 128 experts (64 text, 64 vision) with 8 active experts per modality. It also specifies the use of Grouped Query Attention (GQA) and the PaddlePaddle framework for training. While the base model is clearly a 'from-scratch' pre-trained foundation model, the specific initialization of the vision encoder versus the transformer backbone is well-defined.
Dataset Composition
Documentation mentions the model was trained on 'trillions of tokens' and incorporates 'high-quality Chinese textual and visual data' alongside global web data. While it describes a 'human-model-in-the-loop' iterative cycle for data refinement and mentions general categories (STEM, Chinese visual knowledge, web), it lacks a precise percentage breakdown of the dataset (e.g., exact ratios of code, web, books). Specific data sources are not named individually, and the filtering methodology is described in high-level conceptual terms rather than reproducible technical specifications.
Tokenizer Integrity
The tokenizer is publicly accessible via the Hugging Face repository in both PaddlePaddle and PyTorch (Transformer-style) formats. The technical report and model cards confirm support for a 131,072 token context window. Vocabulary size and tokenization approach are verifiable through the provided `tokenizer_config.json` and `vocab.json` files in the official 'baidu/ERNIE-4.5-VL-424B-A47B-Base-PT' repository, ensuring alignment with the claimed multilingual (Chinese/English) support.
Parameter Density
Baidu provides exemplary transparency regarding parameter counts for this MoE model. It explicitly states a total of 424 billion parameters with exactly 47 billion active parameters per token. The architectural breakdown is detailed, specifying 128 total experts with a 64/64 split between modalities and the activation of 8 experts per token. This prevents the common 'parameter inflation' ambiguity seen in other MoE releases.
Training Compute
The technical report mentions the use of 'heterogeneous hybrid parallelism' and 'FP8 mixed-precision training' on the PaddlePaddle framework. It discloses a Model FLOPs Utilization (MFU) of 47% for the largest language model variant. However, it fails to provide the total number of GPU/TPU hours, the specific hardware cluster size (e.g., number of H100 nodes used for the full pre-training run), or a calculated carbon footprint, which are requirements for a high score in this category.
Benchmark Reproducibility
The model provides results on standard benchmarks like MMMU, MathVista, and CV-Bench, and the technical report includes some hyperparameters (e.g., 2 FPS, 480 max frames for video). While evaluation code is partially available through the ERNIEKit and PaddlePaddle repositories, and inference is supported in vLLM, the exact prompts and few-shot examples used to achieve the reported SOTA scores are not fully disclosed in a centralized, reproducible format.
Identity Consistency
The model exhibits high identity consistency, correctly identifying as part of the ERNIE 4.5 family. Documentation clearly distinguishes between the 'Base' (pre-trained) and 'Chat' (post-trained) variants, as well as the 'Thinking' and 'Non-thinking' operational modes. Versioning is clear across the 10 distinct variants in the family, and there is no evidence of the model claiming to be a competitor's product.
License Clarity
The model weights, code, and development toolkits are released under the highly permissive Apache License 2.0. This is explicitly stated in the technical report, the GitHub repository, and the Hugging Face model cards. The license allows for both research and commercial use without the restrictive 'acceptable use policies' or 'research-only' clauses often found in other 'open' weights releases.
Hardware Footprint
Hardware requirements are well-documented for various configurations. Official guides specify that ~945 GB of VRAM is needed for the 424B model, recommending 12x H100 (80GB) GPUs. It also provides clear guidance on 4-bit and 8-bit quantization (wint4/wint8) using the 'convolutional code quantization' algorithm, noting that 80GB x 8 resources are required for quantized deployment. Context length memory scaling is addressed through the mention of GQA and FlashAttention-like optimizations.
Versioning Drift
Baidu uses a clear naming convention (e.g., 4.5-VL-424B-A47B-Base) and maintains a GitHub repository for the ERNIE 4.5 family. However, there is no public, detailed changelog or semantic versioning system for weight updates (e.g., v4.5.1). While the release is recent, the infrastructure for tracking silent weight updates or behavior drift over time is not yet as robust as established open-source projects.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online