Active Parameters
28B
Context Length
131K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
30 Jun 2025
Knowledge Cutoff
Dec 2024
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
20
Key-Value Heads
4
Attention Head Dimension
128
Position Embedding
Absolute Position Embedding
RoPE Theta
500,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
3,584
Number of Layers
28
FFN Intermediate Size (Dense)
12,288
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
103,424
Mixture of Experts
Total Expert Parameters
3.0B
Number of Experts
130
Active Experts
14
Shared Experts
2
FFN Intermediate Size (per Expert)
-
Dense Layers Before MoE
-
ERNIE-4.5-VL-28B-A3B is a multimodal Mixture-of-Experts (MoE) foundation model developed by Baidu to provide advanced vision-language understanding within an efficient computational envelope. This model variant is designed to bridge the gap between high-capacity reasoning and deployable inference by activating only a subset of its total parameters during any given forward pass. It supports sophisticated multimodal tasks including document and chart interpretation, fine-grained visual perception, and temporal analysis of video sequences. A distinguishing feature is its integration of a 'thinking' mode, which utilizes multi-step reasoning processes to address complex queries that require a deeper semantic alignment between visual and textual data.
Technically, the model is built upon a heterogeneous MoE architecture that facilitates joint pre-training on disparate modalities without interference. This is achieved through modality-isolated routing and the application of router orthogonal loss and multimodal token-balanced loss, ensuring that vision and language experts develop specialized representations while reinforcing mutual understanding. The visual component utilizes a variable-resolution Vision Transformer (ViT) encoder that projects visual features into a shared embedding space. The architecture incorporates Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE) to manage its extensive 131,072-token context length, while post-training optimizations such as Direct Preference Optimization (DPO) and Reinforcement Learning with Verifiable Rewards (RLVR) further refine its alignment and reasoning accuracy.
From a performance and deployment perspective, ERNIE-4.5-VL-28B-A3B is engineered for high throughput and multi-hardware compatibility using the PaddlePaddle framework. It supports 4-bit and 2-bit lossless quantization through convolutional code quantization, enabling efficient execution on hardware with limited memory. The model's reasoning capabilities are enhanced by 'Thinking with Images' functionality, allowing the system to autonomously call tools such as image zooming or external searches to resolve fine-grained details or long-tail visual knowledge. These attributes make it particularly effective for enterprise-grade multimodal agents, industrial visual grounding, and STEM-focused problem-solving scenarios.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
No evaluation benchmarks for ERNIE-4.5-VL-28B-A3B available.
Overall Rank
-
Coding Rank
-
Total Score
74
/ 100
ERNIE-4.5-VL-28B-A3B exhibits high transparency in its architectural design and parameter density, providing clear distinctions between total and active parameters. While it offers excellent clarity on licensing and hardware requirements, it maintains significant gaps in training data provenance and specific compute resource disclosures. The model's overall profile is characterized by strong technical documentation paired with more opaque upstream data sourcing.
Architectural Provenance
The model is explicitly documented as a multimodal Mixture-of-Experts (MoE) architecture in the ERNIE 4.5 Technical Report (2025). It utilizes a heterogeneous MoE structure where text and vision inputs are routed to distinct sets of experts (modality-isolated routing) to prevent cross-modal interference. The architecture incorporates Grouped-Query Attention (GQA), Rotary Position Embeddings (RoPE), and a variable-resolution Vision Transformer (ViT) encoder. The report details specific architectural innovations such as router orthogonal loss and multimodal token-balanced loss used to stabilize training.
Dataset Composition
While the technical report mentions a 'massive-scale' pre-training phase followed by a 1-trillion token 'mid-training' phase focused on high-quality visual-language reasoning data, it lacks a granular percentage breakdown of the dataset composition (e.g., specific ratios of web, code, or academic data). The documentation describes the data collection methodology as a 'human-model-in-the-loop iterative cycle' and mentions the use of 'premium visual-language reasoning cases,' but does not disclose specific public or proprietary sources, scoring it in the mid-range for general categories without proportions.
Tokenizer Integrity
The model uses the 'Tekken' tokenizer with a documented vocabulary size of 131,072 tokens, which is consistent across official documentation and third-party model hubs like OpenRouter and Hugging Face. The tokenizer is publicly accessible via the 'transformers' library and Baidu's PaddlePaddle framework, allowing for direct verification of tokenization behavior and language support alignment.
Parameter Density
Baidu provides exemplary transparency regarding parameter density. The model is clearly defined as having 28 billion total parameters with exactly 3 billion active parameters per token during inference. The technical report further specifies the architectural breakdown, noting that visual experts are designed to be one-third the size of textual experts to optimize efficiency. This level of detail for an MoE model exceeds standard industry disclosures.
Training Compute
The technical report discloses that the largest model in the family was trained on 2,016 NVIDIA H800 GPUs and achieved 47% Model FLOPs Utilization (MFU). However, specific compute hours, total training duration, and carbon footprint data for the 28B variant are not explicitly detailed. While hardware types are mentioned, the lack of specific duration or total energy metrics for this specific variant limits the score.
Benchmark Reproducibility
Baidu provides detailed benchmark results on standard sets like ChartQA (87.1%), MathVista (82.5%), and OCRBench (858). The technical report and GitHub repository provide evaluation code and some prompt examples. However, full reproduction instructions and the exact few-shot examples for every benchmark are not as comprehensive as top-tier open-source projects, and third-party verification is still emerging.
Identity Consistency
The model consistently identifies itself as part of the ERNIE 4.5 family and distinguishes between its 'thinking' and 'non-thinking' modes. There is no evidence of identity confusion or claims of being a competitor's model. Versioning is clearly maintained through the '-PT' (PyTorch) and '-Paddle' suffixes, ensuring users know exactly which variant they are interacting with.
License Clarity
The model is released under the Apache License 2.0, which is a standard, permissive open-source license. The license is explicitly stated on Hugging Face, GitHub, and in the technical report, clearly allowing for both commercial and non-commercial use without conflicting terms or hidden restrictions.
Hardware Footprint
Hardware requirements are exceptionally well-documented. Official guides specify VRAM needs for FP16 (~56-64GB), 4-bit (~14GB), and 2-bit (~7GB) quantization. Documentation explicitly warns that while only 3B parameters are active, the full 28B weights must be loaded into memory, requiring an 80GB GPU (like A100/H100) for unquantized inference. This transparency prevents misleading efficiency claims.
Versioning Drift
Baidu maintains a changelog via the ERNIEKit GitHub repository (e.g., v1.4, v1.5 updates) and uses semantic-style versioning for its toolkit. However, the model weights themselves do not have a granular public versioning history or a documented process for tracking silent behavioral drift over time beyond major release milestones.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online