ApX logoApX logo

ERNIE-4.5-VL-28B-A3B

Active Parameters

28B

Context Length

131.072K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

30 Jun 2025

Knowledge Cutoff

Dec 2024

Technical Specifications

Total Expert Parameters

3.0B

Number of Experts

130

Active Experts

14

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

3584

Number of Layers

28

Attention Heads

20

Key-Value Heads

4

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

ERNIE-4.5-VL-28B-A3B

ERNIE-4.5-VL-28B-A3B is a multimodal Mixture-of-Experts (MoE) foundation model developed by Baidu to provide advanced vision-language understanding within an efficient computational envelope. This model variant is designed to bridge the gap between high-capacity reasoning and deployable inference by activating only a subset of its total parameters during any given forward pass. It supports sophisticated multimodal tasks including document and chart interpretation, fine-grained visual perception, and temporal analysis of video sequences. A distinguishing feature is its integration of a 'thinking' mode, which utilizes multi-step reasoning processes to address complex queries that require a deeper semantic alignment between visual and textual data.

Technically, the model is built upon a heterogeneous MoE architecture that facilitates joint pre-training on disparate modalities without interference. This is achieved through modality-isolated routing and the application of router orthogonal loss and multimodal token-balanced loss, ensuring that vision and language experts develop specialized representations while reinforcing mutual understanding. The visual component utilizes a variable-resolution Vision Transformer (ViT) encoder that projects visual features into a shared embedding space. The architecture incorporates Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE) to manage its extensive 131,072-token context length, while post-training optimizations such as Direct Preference Optimization (DPO) and Reinforcement Learning with Verifiable Rewards (RLVR) further refine its alignment and reasoning accuracy.

From a performance and deployment perspective, ERNIE-4.5-VL-28B-A3B is engineered for high throughput and multi-hardware compatibility using the PaddlePaddle framework. It supports 4-bit and 2-bit lossless quantization through convolutional code quantization, enabling efficient execution on hardware with limited memory. The model's reasoning capabilities are enhanced by 'Thinking with Images' functionality, allowing the system to autonomously call tools such as image zooming or external searches to resolve fine-grained details or long-tail visual knowledge. These attributes make it particularly effective for enterprise-grade multimodal agents, industrial visual grounding, and STEM-focused problem-solving scenarios.

About ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.


Other ERNIE 4.5 Models

Evaluation Benchmarks

No evaluation benchmarks for ERNIE-4.5-VL-28B-A3B available.

Rankings

Overall Rank

-

Coding Rank

-

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs