ApX logoApX logo

ERNIE-4.5-VL-28B-A3B-Base

Active Parameters

28B

Context Length

131.072K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

30 Jun 2025

Knowledge Cutoff

Nov 2024

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

20

Key-Value Heads

4

Attention Head Dimension

-

Position Embedding

Absolute Position Embedding

RoPE Theta

500,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

2,560

Number of Layers

28

FFN Intermediate Size (Dense)

12,288

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

103,424

Mixture of Experts

Total Expert Parameters

3.0B

Number of Experts

130

Active Experts

14

Shared Experts

2

FFN Intermediate Size (per Expert)

-

Dense Layers Before MoE

-

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 2.6k · Context: 131.1k · Vocab: 103.4kx 28 layersRMSNormPre-AttentionGrouped-Query Attention20Q / 4KV headsHead dim: 128+RMSNormPre-FFNSparse MoE FFN (14/130 experts)SwiGLU+Final RMSNormOutput Logits

ERNIE-4.5-VL-28B-A3B-Base

ERNIE-4.5-VL-28B-A3B-Base is a multimodal Mixture-of-Experts (MoE) foundation model developed by Baidu as part of the ERNIE 4.5 model family. Specifically engineered for sophisticated vision-language tasks, the model integrates 28 billion total parameters while activating only 3 billion parameters per token during inference. This sparse activation strategy allows the model to maintain the extensive knowledge capacity of a larger system while significantly reducing the computational overhead and latency typically associated with high-parameter models. It is designed to process and synthesize information across multiple modalities, including text, images, and video, supporting a substantial context length of up to 131,072 tokens.

The technical architecture of the ERNIE-4.5-VL series introduces a heterogeneous MoE structure that facilitates both parameter sharing across modalities and the use of dedicated parameters for individual modalities. Key innovations include modality-isolated routing, which prevents interference between textual and visual learning, as well as router orthogonal loss and multimodal token-balanced loss mechanisms to ensure stable expert utilization. The model employs Grouped-Query Attention (GQA) for efficient memory management and utilizes Rotary Position Embeddings (RoPE) to handle extended context windows. Training is conducted within the PaddlePaddle deep learning framework using advanced parallelization strategies, including intra-node expert parallelism and FP8 mixed-precision training.

In operation, the ERNIE-4.5-VL-28B-A3B-Base serves as a versatile backbone for applications requiring high-fidelity cross-modal reasoning. It supports distinct functional modes, including a "thinking" mode for enhanced logical reasoning and a "non-thinking" mode optimized for perceptual tasks such as document analysis, optical character recognition (OCR), and visual knowledge retrieval. Its capabilities extend to agentic interactions, where it can utilize external tools for fine-grained image zooming or search. The model is released with open weights under the Apache 2.0 license, providing a flexible resource for developers and researchers to deploy multimodal solutions across various hardware platforms.

About ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.


Other ERNIE 4.5 Models

Evaluation Benchmarks

No evaluation benchmarks for ERNIE-4.5-VL-28B-A3B-Base available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

67 / 100

ERNIE-4.5-VL-28B-A3B-Base Model Integrity Report

Total Score

67

/ 100

B

Audit Note

ERNIE-4.5-VL-28B-A3B-Base demonstrates strong transparency in its architectural design and licensing, particularly regarding its Mixture-of-Experts parameter density and its open-source Apache 2.0 status. However, it remains opaque concerning its specific training data sources and the total compute resources utilized during development. While technical documentation is available, the reproducibility of its benchmark claims relies heavily on vendor-provided tools without exhaustive public verification.

Upstream

19.5 / 30

Architectural Provenance

7.5 / 10

The model is explicitly identified as a multimodal Mixture-of-Experts (MoE) transformer within the ERNIE 4.5 family. Baidu provides a technical report and GitHub documentation detailing a 'heterogeneous MoE' structure that uses modality-isolated routing to separate visual and textual processing. It specifies the use of Grouped-Query Attention (GQA), Rotary Position Embeddings (RoPE), and a variable-resolution Vision Transformer (ViT) encoder. While the high-level architecture is well-documented, specific layer-by-layer configurations and the exact pre-training data mixture are not fully disclosed.

Dataset Composition

4.0 / 10

Baidu mentions the use of a 'vast and highly diverse corpus' of visual-language reasoning data and 'premium' datasets during a mid-training phase. However, there is no detailed breakdown of the data sources (e.g., specific web crawls, book datasets, or code repositories) or the exact proportions of each modality. The filtering and cleaning methodologies are described in general terms ('systematic data construction') without providing verifiable metrics or access to sample data.

Tokenizer Integrity

8.0 / 10

The model uses a tokenizer compatible with the PaddlePaddle and Transformers frameworks, with vocabulary and implementation details available through the official ERNIEKit and Hugging Face repositories. It supports a context length of 131,072 tokens. While the tokenizer's code is public, detailed documentation on its specific training data alignment and normalization procedures is less comprehensive than the architectural details.

Model

26.0 / 40

Parameter Density

9.0 / 10

Baidu is highly transparent regarding the MoE parameter distribution, explicitly stating a total of 28 billion parameters with 3 billion active parameters per token during inference. The documentation distinguishes between shared experts and dedicated modality-specific experts. This level of detail regarding sparse activation is exemplary compared to many competitors who only disclose total counts.

Training Compute

3.0 / 10

Documentation confirms the use of the PaddlePaddle framework and mentions optimizations for NVIDIA Hopper (FP8) and Ampere (INT8) architectures. However, specific compute metrics such as total GPU/TPU hours, the number of chips used, training duration, and the estimated carbon footprint are conspicuously absent from public reports.

Benchmark Reproducibility

5.0 / 10

Baidu provides results for standard benchmarks like MathVista, ChartQA, and OCRBench, and includes some evaluation scripts within the ERNIEKit repository. However, the exact prompts, few-shot examples, and specific versions for all benchmarks are not consistently detailed. Independent third-party verification is limited, and some results remain vendor-published without full reproduction instructions.

Identity Consistency

9.0 / 10

The model consistently identifies as part of the ERNIE 4.5 family and maintains clear versioning between its 'Thinking' and 'Base' variants. It is transparent about its multimodal nature and its specific 'thinking' vs 'non-thinking' operational modes. There are no documented instances of the model claiming to be a competitor's product or misrepresenting its core identity.

Downstream

21.5 / 30

License Clarity

9.5 / 10

The model weights and associated code are released under the Apache License 2.0, which is a standard, permissive open-source license. This allows for both commercial and non-commercial use with clear terms. The license is prominently displayed on Hugging Face, GitHub, and in the technical report, with no conflicting proprietary terms found in the primary documentation.

Hardware Footprint

7.0 / 10

Baidu provides specific VRAM requirements for different deployment scenarios, noting that 80GB is required for full FP16 inference on a single card, while quantization (WINT8) can reduce this to approximately 60GB. They also provide guidance for multi-GPU setups and vLLM integration. While helpful, more detailed scaling data for different context lengths and batch sizes would be required for a higher score.

Versioning Drift

5.0 / 10

The model follows a versioned release cycle (e.g., v1.0 to v1.5 of ERNIEKit) and maintains a changelog on GitHub. However, the documentation of 'silent' updates to the weights or changes in safety filtering is sparse. There is no formal system for tracking performance drift over time or a clear policy for accessing deprecated versions of the weights.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs

ERNIE-4.5-VL-28B-A3B-Base: Specifications and GPU VRAM Requirements