ApX logoApX logo

ERNIE-4.5-VL-28B-A3B-Base

Active Parameters

28B

Context Length

131.072K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

30 Jun 2025

Knowledge Cutoff

Nov 2024

Technical Specifications

Total Expert Parameters

3.0B

Number of Experts

130

Active Experts

14

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

-

Number of Layers

28

Attention Heads

20

Key-Value Heads

4

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

ERNIE-4.5-VL-28B-A3B-Base

ERNIE-4.5-VL-28B-A3B-Base is a multimodal Mixture-of-Experts (MoE) foundation model developed by Baidu as part of the ERNIE 4.5 model family. Specifically engineered for sophisticated vision-language tasks, the model integrates 28 billion total parameters while activating only 3 billion parameters per token during inference. This sparse activation strategy allows the model to maintain the extensive knowledge capacity of a larger system while significantly reducing the computational overhead and latency typically associated with high-parameter models. It is designed to process and synthesize information across multiple modalities, including text, images, and video, supporting a substantial context length of up to 131,072 tokens.

The technical architecture of the ERNIE-4.5-VL series introduces a heterogeneous MoE structure that facilitates both parameter sharing across modalities and the use of dedicated parameters for individual modalities. Key innovations include modality-isolated routing, which prevents interference between textual and visual learning, as well as router orthogonal loss and multimodal token-balanced loss mechanisms to ensure stable expert utilization. The model employs Grouped-Query Attention (GQA) for efficient memory management and utilizes Rotary Position Embeddings (RoPE) to handle extended context windows. Training is conducted within the PaddlePaddle deep learning framework using advanced parallelization strategies, including intra-node expert parallelism and FP8 mixed-precision training.

In operation, the ERNIE-4.5-VL-28B-A3B-Base serves as a versatile backbone for applications requiring high-fidelity cross-modal reasoning. It supports distinct functional modes, including a "thinking" mode for enhanced logical reasoning and a "non-thinking" mode optimized for perceptual tasks such as document analysis, optical character recognition (OCR), and visual knowledge retrieval. Its capabilities extend to agentic interactions, where it can utilize external tools for fine-grained image zooming or search. The model is released with open weights under the Apache 2.0 license, providing a flexible resource for developers and researchers to deploy multimodal solutions across various hardware platforms.

About ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.


Other ERNIE 4.5 Models

Evaluation Benchmarks

No evaluation benchmarks for ERNIE-4.5-VL-28B-A3B-Base available.

Rankings

Overall Rank

-

Coding Rank

-

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs