Active Parameters
28B
Context Length
131.072K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
30 Jun 2025
Knowledge Cutoff
Nov 2024
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
20
Key-Value Heads
4
Attention Head Dimension
-
Position Embedding
Absolute Position Embedding
RoPE Theta
500,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
2,560
Number of Layers
28
FFN Intermediate Size (Dense)
12,288
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
103,424
Mixture of Experts
Total Expert Parameters
3.0B
Number of Experts
130
Active Experts
14
Shared Experts
2
FFN Intermediate Size (per Expert)
-
Dense Layers Before MoE
-
ERNIE-4.5-VL-28B-A3B-Base is a multimodal Mixture-of-Experts (MoE) foundation model developed by Baidu as part of the ERNIE 4.5 model family. Specifically engineered for sophisticated vision-language tasks, the model integrates 28 billion total parameters while activating only 3 billion parameters per token during inference. This sparse activation strategy allows the model to maintain the extensive knowledge capacity of a larger system while significantly reducing the computational overhead and latency typically associated with high-parameter models. It is designed to process and synthesize information across multiple modalities, including text, images, and video, supporting a substantial context length of up to 131,072 tokens.
The technical architecture of the ERNIE-4.5-VL series introduces a heterogeneous MoE structure that facilitates both parameter sharing across modalities and the use of dedicated parameters for individual modalities. Key innovations include modality-isolated routing, which prevents interference between textual and visual learning, as well as router orthogonal loss and multimodal token-balanced loss mechanisms to ensure stable expert utilization. The model employs Grouped-Query Attention (GQA) for efficient memory management and utilizes Rotary Position Embeddings (RoPE) to handle extended context windows. Training is conducted within the PaddlePaddle deep learning framework using advanced parallelization strategies, including intra-node expert parallelism and FP8 mixed-precision training.
In operation, the ERNIE-4.5-VL-28B-A3B-Base serves as a versatile backbone for applications requiring high-fidelity cross-modal reasoning. It supports distinct functional modes, including a "thinking" mode for enhanced logical reasoning and a "non-thinking" mode optimized for perceptual tasks such as document analysis, optical character recognition (OCR), and visual knowledge retrieval. Its capabilities extend to agentic interactions, where it can utilize external tools for fine-grained image zooming or search. The model is released with open weights under the Apache 2.0 license, providing a flexible resource for developers and researchers to deploy multimodal solutions across various hardware platforms.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
No evaluation benchmarks for ERNIE-4.5-VL-28B-A3B-Base available.
Overall Rank
-
Coding Rank
-
Total Score
67
/ 100
ERNIE-4.5-VL-28B-A3B-Base demonstrates strong transparency in its architectural design and licensing, particularly regarding its Mixture-of-Experts parameter density and its open-source Apache 2.0 status. However, it remains opaque concerning its specific training data sources and the total compute resources utilized during development. While technical documentation is available, the reproducibility of its benchmark claims relies heavily on vendor-provided tools without exhaustive public verification.
Architectural Provenance
The model is explicitly identified as a multimodal Mixture-of-Experts (MoE) transformer within the ERNIE 4.5 family. Baidu provides a technical report and GitHub documentation detailing a 'heterogeneous MoE' structure that uses modality-isolated routing to separate visual and textual processing. It specifies the use of Grouped-Query Attention (GQA), Rotary Position Embeddings (RoPE), and a variable-resolution Vision Transformer (ViT) encoder. While the high-level architecture is well-documented, specific layer-by-layer configurations and the exact pre-training data mixture are not fully disclosed.
Dataset Composition
Baidu mentions the use of a 'vast and highly diverse corpus' of visual-language reasoning data and 'premium' datasets during a mid-training phase. However, there is no detailed breakdown of the data sources (e.g., specific web crawls, book datasets, or code repositories) or the exact proportions of each modality. The filtering and cleaning methodologies are described in general terms ('systematic data construction') without providing verifiable metrics or access to sample data.
Tokenizer Integrity
The model uses a tokenizer compatible with the PaddlePaddle and Transformers frameworks, with vocabulary and implementation details available through the official ERNIEKit and Hugging Face repositories. It supports a context length of 131,072 tokens. While the tokenizer's code is public, detailed documentation on its specific training data alignment and normalization procedures is less comprehensive than the architectural details.
Parameter Density
Baidu is highly transparent regarding the MoE parameter distribution, explicitly stating a total of 28 billion parameters with 3 billion active parameters per token during inference. The documentation distinguishes between shared experts and dedicated modality-specific experts. This level of detail regarding sparse activation is exemplary compared to many competitors who only disclose total counts.
Training Compute
Documentation confirms the use of the PaddlePaddle framework and mentions optimizations for NVIDIA Hopper (FP8) and Ampere (INT8) architectures. However, specific compute metrics such as total GPU/TPU hours, the number of chips used, training duration, and the estimated carbon footprint are conspicuously absent from public reports.
Benchmark Reproducibility
Baidu provides results for standard benchmarks like MathVista, ChartQA, and OCRBench, and includes some evaluation scripts within the ERNIEKit repository. However, the exact prompts, few-shot examples, and specific versions for all benchmarks are not consistently detailed. Independent third-party verification is limited, and some results remain vendor-published without full reproduction instructions.
Identity Consistency
The model consistently identifies as part of the ERNIE 4.5 family and maintains clear versioning between its 'Thinking' and 'Base' variants. It is transparent about its multimodal nature and its specific 'thinking' vs 'non-thinking' operational modes. There are no documented instances of the model claiming to be a competitor's product or misrepresenting its core identity.
License Clarity
The model weights and associated code are released under the Apache License 2.0, which is a standard, permissive open-source license. This allows for both commercial and non-commercial use with clear terms. The license is prominently displayed on Hugging Face, GitHub, and in the technical report, with no conflicting proprietary terms found in the primary documentation.
Hardware Footprint
Baidu provides specific VRAM requirements for different deployment scenarios, noting that 80GB is required for full FP16 inference on a single card, while quantization (WINT8) can reduce this to approximately 60GB. They also provide guidance for multi-GPU setups and vLLM integration. While helpful, more detailed scaling data for different context lengths and batch sizes would be required for a higher score.
Versioning Drift
The model follows a versioned release cycle (e.g., v1.0 to v1.5 of ERNIEKit) and maintains a changelog on GitHub. However, the documentation of 'silent' updates to the weights or changes in safety filtering is sparse. There is no formal system for tracking performance drift over time or a clear policy for accessing deprecated versions of the weights.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online