ApX logoApX logo

ERNIE-4.5-300B-A47B

Active Parameters

300B

Context Length

131K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

30 Jun 2025

Knowledge Cutoff

Mar 2025

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

64

Key-Value Heads

8

Attention Head Dimension

-

Position Embedding

Absolute Position Embedding

RoPE Theta

500,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

Swish

Dimensions

Hidden Dimension Size

8,192

Number of Layers

54

FFN Intermediate Size (Dense)

3,584

Multi-Token Prediction Heads

1

Tokenizer

Vocabulary Size

103,424

Mixture of Experts

Total Expert Parameters

47.0B

Number of Experts

64

Active Experts

8

Shared Experts

0

FFN Intermediate Size (per Expert)

3,584

Dense Layers Before MoE

3

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 8.2k · Context: 131K · Vocab: 103.4kx 54 layersRMSNormPre-AttentionGrouped-Query Attention64Q / 8KV headsHead dim: 128+RMSNormPre-FFNSparse MoE FFN (8/64 experts)SwishIntermediate: 3.6k+Final RMSNormOutput Logits

ERNIE-4.5-300B-A47B

ERNIE-4.5-300B-A47B is a large-scale Mixture-of-Experts (MoE) foundation model developed by Baidu as a core component of the ERNIE 4.5 family. While the broader series encompasses multimodal capabilities, this specific variant is a text-focused model optimized for advanced natural language understanding, complex reasoning, and high-performance text generation in both English and Chinese. It serves as a high-capacity solution for knowledge-intensive tasks, balancing the expansive knowledge base of a 300-billion parameter system with the computational efficiency of sparse activation.

The technical architecture employs a novel heterogeneous MoE structure that facilitates parameter sharing while utilizing modality-isolated routing to prevent cross-modal interference during pre-training. It features 54 Transformer layers and 64 total experts, with 8 active experts per token, resulting in 47 billion active parameters during inference. The model utilizes Grouped Query Attention (GQA) with 64 query heads and 8 key-value heads to optimize memory bandwidth and throughput. Training was conducted using the PaddlePaddle deep learning framework, incorporating intra-node expert parallelism, memory-efficient pipeline scheduling, and FP8 mixed-precision training to achieve high hardware utilization.

Operational efficiency is enhanced through support for near-lossless 4-bit and 2-bit quantization, enabling deployment on a variety of hardware configurations including single-card and multi-GPU setups. The model maintains a substantial context window of 131,072 tokens, allowing for the processing of long-form documents and maintaining coherence across extended dialogues. For post-training, the model undergoes Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Unified Preference Optimization (UPO) to align outputs with user instructions and ensure robust performance in production environments.

About ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.


Other ERNIE 4.5 Models

Evaluation Benchmarks

No evaluation benchmarks for ERNIE-4.5-300B-A47B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B+

72 / 100

ERNIE-4.5-300B-A47B Model Integrity Report

Total Score

72

/ 100

B+

Audit Note

ERNIE 4.5 300B-A47B demonstrates a strong commitment to architectural and licensing transparency, providing a detailed technical report and a permissive Apache 2.0 license. While it excels in disclosing parameter density and hardware requirements, it remains vague regarding the specific composition of its training data and the total compute hours consumed. The model's transparency profile is robust for a large-scale MoE system, though further disclosure of evaluation prompts and environmental impact is needed.

Upstream

21.0 / 30

Architectural Provenance

8.0 / 10

The model's architecture is extensively documented in the ERNIE 4.5 Technical Report (2025). It details a novel 'multimodal heterogeneous MoE' structure with 54 Transformer layers and 64 total experts (8 active per token). The report specifies the use of Grouped Query Attention (GQA) with 64 query heads and 8 KV heads, and modality-isolated routing to prevent cross-modal interference. While it is a 'from scratch' training rather than a fine-tune of a known base, the architectural modifications and pre-training procedures are transparently described.

Dataset Composition

4.5 / 10

The technical report mentions a diverse corpus including web documents, books, academic papers, and conversational data, but lacks a specific percentage breakdown (e.g., exact ratios of code vs. web). It provides good detail on the 'human-model-in-the-loop' iterative refinement process and the use of over 10 specialized NLP models for quality scoring (toxicity, factuality), but the specific data sources remain generalized under 'public and in-house data' without naming specific datasets.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the official GitHub (ERNIEKit) and Hugging Face repositories. It uses a SentencePiece-based approach with a documented vocabulary size of 103,424 tokens. The implementation details, including special tokens (e.g., mask tokens) and BPE pre-tokenization support, are verifiable through the provided source code and configuration files.

Model

28.5 / 40

Parameter Density

9.0 / 10

Baidu is highly transparent regarding parameter density, explicitly stating the 300B total parameter count and the 47B active parameter count for this MoE variant. The technical report further breaks down the expert structure (64 total, 8 active) and the heterogeneous allocation (vision experts being one-third the size of text experts), which is exemplary for MoE transparency.

Training Compute

4.0 / 10

The technical report discloses the hardware used (2016 NVIDIA H800 GPUs) and achieves a 47% Model FLOPs Utilization (MFU). However, it fails to provide the total training duration in hours or a calculated carbon footprint, which are required for a high score. It mentions 'optimal efficiency' and 'limited compute resources' but lacks the raw temporal data for independent cost/impact verification.

Benchmark Reproducibility

6.0 / 10

Baidu provides extensive benchmark results across 28 datasets (MMLU-Pro, IFEval, GSM8K, etc.) and specifies versioning (e.g., DeepSeek-V3-0324 comparison). While evaluation code is partially available through the ERNIEKit repository, the exact prompts and few-shot examples for every benchmark are not fully disclosed in a single reproducible manifest, and third-party verification is still pending for the most recent 4.5 claims.

Identity Consistency

9.5 / 10

The model exhibits high identity consistency, correctly identifying as ERNIE 4.5 across various interfaces. It maintains clear versioning within the family (e.g., distinguishing between the 300B-A47B and the 21B-A3B variants). There are no documented cases of the model claiming to be a competitor's system or misrepresenting its MoE nature.

Downstream

22.5 / 30

License Clarity

10.0 / 10

The model is released under a clear, standard Apache 2.0 license, which is explicitly stated in the technical report, GitHub repository, and Hugging Face model cards. This license permits both research and commercial use without the restrictive 'custom' terms often found in other large-scale model releases.

Hardware Footprint

7.5 / 10

Hardware requirements are well-documented, with specific VRAM guidance for different precisions (e.g., 4x 80GB GPUs for 4-bit, 1x 141GB for 2-bit). The documentation also details the impact of quantization (W4A8C8, WINT2) and provides deployment scripts via FastDeploy. Scaling for the 131k context window is mentioned but lacks a detailed memory-per-token scaling chart.

Versioning Drift

5.0 / 10

Baidu uses a clear naming convention for variants, but a formal, public-facing changelog tracking specific weight updates or 'silent' alignment shifts is not prominently maintained. While major versions are distinct, the granular tracking of model drift over time is less transparent than the initial release documentation.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs

ERNIE-4.5-300B-A47B: Specifications and GPU VRAM Requirements