ERNIE-4.5-VL-28B-A3B: Specifications and GPU VRAM Requirements

ERNIE-4.5-VL-28B-A3B

Open Source

Open Weights

Active Parameters

28B

Context Length

131.072K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

30 Jun 2025

Knowledge Cutoff

Technical Specifications

Total Expert Parameters

3.0B

Number of Experts

130

Active Experts

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

Normalization

Position Embedding

Absolute Position Embedding

System Requirements

VRAM requirements for different quantization methods and context sizes

ERNIE-4.5-VL-28B-A3B

The ERNIE-4.5-VL-28B-A3B is a member of the Baidu ERNIE 4.5 model family, a recent collection of large-scale multimodal foundation models. This specific variant functions as a lightweight vision-language model, engineered to process both textual and visual inputs. Its core purpose involves enabling advanced multimodal understanding, encompassing tasks such as image comprehension, text generation informed by visual context, and cross-modal reasoning. The model aims to achieve an equilibrium between performance efficacy and computational resource efficiency, rendering it suitable for enterprise applications and diverse real-world deployment scenarios that necessitate robust multimodal capabilities.

Architecturally, ERNIE-4.5-VL-28B-A3B is constructed upon a fine-grained Mixture-of-Experts (MoE) backbone, a key innovation across the ERNIE 4.5 series. This heterogeneous MoE structure facilitates joint training on textual and visual modalities. It incorporates modality-isolated routing and employs techniques such as router orthogonal loss and multimodal token-balanced loss to prevent interference between modalities and ensure effective representation and mutual reinforcement during training. The model further benefits from modality-specific post-training optimizations, including supervised fine-tuning, direct preference optimization, and Reinforcement Learning with Verifiable Rewards (RLVR), to enhance its performance in vision-language tasks. Visual inputs are processed by a variable-resolution Vision Transformer (ViT) encoder, with representations then projected into a shared embedding space via an adapter.

For performance characteristics, the ERNIE-4.5-VL-28B-A3B supports both "thinking" and "non-thinking" modes, offering flexibility in reasoning approaches. The model demonstrates proficiency in visual perception, document and chart understanding, and visual knowledge, maintaining strong performance across relevant tasks. Efficient inference is achieved through methods like multi-expert parallel collaboration and convolutional code quantization, enabling 4-bit/2-bit lossless quantization for deployment across a range of hardware platforms. The model can process long-form text inputs with a substantial context length, supporting extended conversations and complex reasoning that combines textual knowledge with visual perception.

About ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.

Other ERNIE 4.5 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

No evaluation benchmarks for ERNIE-4.5-VL-28B-A3B available.

Rankings

Overall Rank

Coding Rank

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

64k

128k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Read the Paper Download Weights Source Code