ApX logoApX logo

ERNIE-4.5-VL-28B-A3B

Active Parameters

28B

Context Length

131K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

30 Jun 2025

Knowledge Cutoff

Dec 2024

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

20

Key-Value Heads

4

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

500,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

3,584

Number of Layers

28

FFN Intermediate Size (Dense)

12,288

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

103,424

Mixture of Experts

Total Expert Parameters

3.0B

Number of Experts

130

Active Experts

14

Shared Experts

2

FFN Intermediate Size (per Expert)

-

Dense Layers Before MoE

-

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 3.6k · Context: 131K · Vocab: 103.4kx 28 layersRMSNormPre-AttentionGrouped-Query Attention20Q / 4KV headsHead dim: 128+RMSNormPre-FFNSparse MoE FFN (14/130 experts)SwiGLU+Final RMSNormOutput Logits

ERNIE-4.5-VL-28B-A3B

ERNIE-4.5-VL-28B-A3B is a multimodal Mixture-of-Experts (MoE) foundation model developed by Baidu to provide advanced vision-language understanding within an efficient computational envelope. This model variant is designed to bridge the gap between high-capacity reasoning and deployable inference by activating only a subset of its total parameters during any given forward pass. It supports sophisticated multimodal tasks including document and chart interpretation, fine-grained visual perception, and temporal analysis of video sequences. A distinguishing feature is its integration of a 'thinking' mode, which utilizes multi-step reasoning processes to address complex queries that require a deeper semantic alignment between visual and textual data.

Technically, the model is built upon a heterogeneous MoE architecture that facilitates joint pre-training on disparate modalities without interference. This is achieved through modality-isolated routing and the application of router orthogonal loss and multimodal token-balanced loss, ensuring that vision and language experts develop specialized representations while reinforcing mutual understanding. The visual component utilizes a variable-resolution Vision Transformer (ViT) encoder that projects visual features into a shared embedding space. The architecture incorporates Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE) to manage its extensive 131,072-token context length, while post-training optimizations such as Direct Preference Optimization (DPO) and Reinforcement Learning with Verifiable Rewards (RLVR) further refine its alignment and reasoning accuracy.

From a performance and deployment perspective, ERNIE-4.5-VL-28B-A3B is engineered for high throughput and multi-hardware compatibility using the PaddlePaddle framework. It supports 4-bit and 2-bit lossless quantization through convolutional code quantization, enabling efficient execution on hardware with limited memory. The model's reasoning capabilities are enhanced by 'Thinking with Images' functionality, allowing the system to autonomously call tools such as image zooming or external searches to resolve fine-grained details or long-tail visual knowledge. These attributes make it particularly effective for enterprise-grade multimodal agents, industrial visual grounding, and STEM-focused problem-solving scenarios.

About ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.


Other ERNIE 4.5 Models

Evaluation Benchmarks

No evaluation benchmarks for ERNIE-4.5-VL-28B-A3B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B+

74 / 100

ERNIE-4.5-VL-28B-A3B Model Integrity Report

Total Score

74

/ 100

B+

Audit Note

ERNIE-4.5-VL-28B-A3B exhibits high transparency in its architectural design and parameter density, providing clear distinctions between total and active parameters. While it offers excellent clarity on licensing and hardware requirements, it maintains significant gaps in training data provenance and specific compute resource disclosures. The model's overall profile is characterized by strong technical documentation paired with more opaque upstream data sourcing.

Upstream

21.0 / 30

Architectural Provenance

8.0 / 10

The model is explicitly documented as a multimodal Mixture-of-Experts (MoE) architecture in the ERNIE 4.5 Technical Report (2025). It utilizes a heterogeneous MoE structure where text and vision inputs are routed to distinct sets of experts (modality-isolated routing) to prevent cross-modal interference. The architecture incorporates Grouped-Query Attention (GQA), Rotary Position Embeddings (RoPE), and a variable-resolution Vision Transformer (ViT) encoder. The report details specific architectural innovations such as router orthogonal loss and multimodal token-balanced loss used to stabilize training.

Dataset Composition

4.5 / 10

While the technical report mentions a 'massive-scale' pre-training phase followed by a 1-trillion token 'mid-training' phase focused on high-quality visual-language reasoning data, it lacks a granular percentage breakdown of the dataset composition (e.g., specific ratios of web, code, or academic data). The documentation describes the data collection methodology as a 'human-model-in-the-loop iterative cycle' and mentions the use of 'premium visual-language reasoning cases,' but does not disclose specific public or proprietary sources, scoring it in the mid-range for general categories without proportions.

Tokenizer Integrity

8.5 / 10

The model uses the 'Tekken' tokenizer with a documented vocabulary size of 131,072 tokens, which is consistent across official documentation and third-party model hubs like OpenRouter and Hugging Face. The tokenizer is publicly accessible via the 'transformers' library and Baidu's PaddlePaddle framework, allowing for direct verification of tokenization behavior and language support alignment.

Model

29.0 / 40

Parameter Density

9.0 / 10

Baidu provides exemplary transparency regarding parameter density. The model is clearly defined as having 28 billion total parameters with exactly 3 billion active parameters per token during inference. The technical report further specifies the architectural breakdown, noting that visual experts are designed to be one-third the size of textual experts to optimize efficiency. This level of detail for an MoE model exceeds standard industry disclosures.

Training Compute

4.0 / 10

The technical report discloses that the largest model in the family was trained on 2,016 NVIDIA H800 GPUs and achieved 47% Model FLOPs Utilization (MFU). However, specific compute hours, total training duration, and carbon footprint data for the 28B variant are not explicitly detailed. While hardware types are mentioned, the lack of specific duration or total energy metrics for this specific variant limits the score.

Benchmark Reproducibility

6.5 / 10

Baidu provides detailed benchmark results on standard sets like ChartQA (87.1%), MathVista (82.5%), and OCRBench (858). The technical report and GitHub repository provide evaluation code and some prompt examples. However, full reproduction instructions and the exact few-shot examples for every benchmark are not as comprehensive as top-tier open-source projects, and third-party verification is still emerging.

Identity Consistency

9.5 / 10

The model consistently identifies itself as part of the ERNIE 4.5 family and distinguishes between its 'thinking' and 'non-thinking' modes. There is no evidence of identity confusion or claims of being a competitor's model. Versioning is clearly maintained through the '-PT' (PyTorch) and '-Paddle' suffixes, ensuring users know exactly which variant they are interacting with.

Downstream

24.0 / 30

License Clarity

10.0 / 10

The model is released under the Apache License 2.0, which is a standard, permissive open-source license. The license is explicitly stated on Hugging Face, GitHub, and in the technical report, clearly allowing for both commercial and non-commercial use without conflicting terms or hidden restrictions.

Hardware Footprint

8.5 / 10

Hardware requirements are exceptionally well-documented. Official guides specify VRAM needs for FP16 (~56-64GB), 4-bit (~14GB), and 2-bit (~7GB) quantization. Documentation explicitly warns that while only 3B parameters are active, the full 28B weights must be loaded into memory, requiring an 80GB GPU (like A100/H100) for unquantized inference. This transparency prevents misleading efficiency claims.

Versioning Drift

5.5 / 10

Baidu maintains a changelog via the ERNIEKit GitHub repository (e.g., v1.4, v1.5 updates) and uses semantic-style versioning for its toolkit. However, the model weights themselves do not have a granular public versioning history or a documented process for tracking silent behavioral drift over time beyond major release milestones.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs

ERNIE-4.5-VL-28B-A3B: Specifications and GPU VRAM Requirements