Active Parameters
300B
Context Length
131K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
30 Jun 2025
Knowledge Cutoff
Mar 2025
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
64
Key-Value Heads
8
Attention Head Dimension
-
Position Embedding
Absolute Position Embedding
RoPE Theta
500,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
Swish
Dimensions
Hidden Dimension Size
8,192
Number of Layers
54
FFN Intermediate Size (Dense)
3,584
Multi-Token Prediction Heads
1
Tokenizer
Vocabulary Size
103,424
Mixture of Experts
Total Expert Parameters
47.0B
Number of Experts
64
Active Experts
8
Shared Experts
0
FFN Intermediate Size (per Expert)
3,584
Dense Layers Before MoE
3
ERNIE-4.5-300B-A47B is a large-scale Mixture-of-Experts (MoE) foundation model developed by Baidu as a core component of the ERNIE 4.5 family. While the broader series encompasses multimodal capabilities, this specific variant is a text-focused model optimized for advanced natural language understanding, complex reasoning, and high-performance text generation in both English and Chinese. It serves as a high-capacity solution for knowledge-intensive tasks, balancing the expansive knowledge base of a 300-billion parameter system with the computational efficiency of sparse activation.
The technical architecture employs a novel heterogeneous MoE structure that facilitates parameter sharing while utilizing modality-isolated routing to prevent cross-modal interference during pre-training. It features 54 Transformer layers and 64 total experts, with 8 active experts per token, resulting in 47 billion active parameters during inference. The model utilizes Grouped Query Attention (GQA) with 64 query heads and 8 key-value heads to optimize memory bandwidth and throughput. Training was conducted using the PaddlePaddle deep learning framework, incorporating intra-node expert parallelism, memory-efficient pipeline scheduling, and FP8 mixed-precision training to achieve high hardware utilization.
Operational efficiency is enhanced through support for near-lossless 4-bit and 2-bit quantization, enabling deployment on a variety of hardware configurations including single-card and multi-GPU setups. The model maintains a substantial context window of 131,072 tokens, allowing for the processing of long-form documents and maintaining coherence across extended dialogues. For post-training, the model undergoes Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Unified Preference Optimization (UPO) to align outputs with user instructions and ensure robust performance in production environments.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
No evaluation benchmarks for ERNIE-4.5-300B-A47B available.
Overall Rank
-
Coding Rank
-
Total Score
72
/ 100
ERNIE 4.5 300B-A47B demonstrates a strong commitment to architectural and licensing transparency, providing a detailed technical report and a permissive Apache 2.0 license. While it excels in disclosing parameter density and hardware requirements, it remains vague regarding the specific composition of its training data and the total compute hours consumed. The model's transparency profile is robust for a large-scale MoE system, though further disclosure of evaluation prompts and environmental impact is needed.
Architectural Provenance
The model's architecture is extensively documented in the ERNIE 4.5 Technical Report (2025). It details a novel 'multimodal heterogeneous MoE' structure with 54 Transformer layers and 64 total experts (8 active per token). The report specifies the use of Grouped Query Attention (GQA) with 64 query heads and 8 KV heads, and modality-isolated routing to prevent cross-modal interference. While it is a 'from scratch' training rather than a fine-tune of a known base, the architectural modifications and pre-training procedures are transparently described.
Dataset Composition
The technical report mentions a diverse corpus including web documents, books, academic papers, and conversational data, but lacks a specific percentage breakdown (e.g., exact ratios of code vs. web). It provides good detail on the 'human-model-in-the-loop' iterative refinement process and the use of over 10 specialized NLP models for quality scoring (toxicity, factuality), but the specific data sources remain generalized under 'public and in-house data' without naming specific datasets.
Tokenizer Integrity
The tokenizer is publicly accessible via the official GitHub (ERNIEKit) and Hugging Face repositories. It uses a SentencePiece-based approach with a documented vocabulary size of 103,424 tokens. The implementation details, including special tokens (e.g., mask tokens) and BPE pre-tokenization support, are verifiable through the provided source code and configuration files.
Parameter Density
Baidu is highly transparent regarding parameter density, explicitly stating the 300B total parameter count and the 47B active parameter count for this MoE variant. The technical report further breaks down the expert structure (64 total, 8 active) and the heterogeneous allocation (vision experts being one-third the size of text experts), which is exemplary for MoE transparency.
Training Compute
The technical report discloses the hardware used (2016 NVIDIA H800 GPUs) and achieves a 47% Model FLOPs Utilization (MFU). However, it fails to provide the total training duration in hours or a calculated carbon footprint, which are required for a high score. It mentions 'optimal efficiency' and 'limited compute resources' but lacks the raw temporal data for independent cost/impact verification.
Benchmark Reproducibility
Baidu provides extensive benchmark results across 28 datasets (MMLU-Pro, IFEval, GSM8K, etc.) and specifies versioning (e.g., DeepSeek-V3-0324 comparison). While evaluation code is partially available through the ERNIEKit repository, the exact prompts and few-shot examples for every benchmark are not fully disclosed in a single reproducible manifest, and third-party verification is still pending for the most recent 4.5 claims.
Identity Consistency
The model exhibits high identity consistency, correctly identifying as ERNIE 4.5 across various interfaces. It maintains clear versioning within the family (e.g., distinguishing between the 300B-A47B and the 21B-A3B variants). There are no documented cases of the model claiming to be a competitor's system or misrepresenting its MoE nature.
License Clarity
The model is released under a clear, standard Apache 2.0 license, which is explicitly stated in the technical report, GitHub repository, and Hugging Face model cards. This license permits both research and commercial use without the restrictive 'custom' terms often found in other large-scale model releases.
Hardware Footprint
Hardware requirements are well-documented, with specific VRAM guidance for different precisions (e.g., 4x 80GB GPUs for 4-bit, 1x 141GB for 2-bit). The documentation also details the impact of quantization (W4A8C8, WINT2) and provides deployment scripts via FastDeploy. Scaling for the 131k context window is mentioned but lacks a detailed memory-per-token scaling chart.
Versioning Drift
Baidu uses a clear naming convention for variants, but a formal, public-facing changelog tracking specific weight updates or 'silent' alignment shifts is not prominently maintained. While major versions are distinct, the granular tracking of model drift over time is less transparent than the initial release documentation.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online