ApX logoApX logo

Hunyuan Large

Active Parameters

389B

Context Length

28K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Tencent Hunyuan Community License

Release Date

5 Nov 2024

Knowledge Cutoff

Sep 2024

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

64

Key-Value Heads

64

Attention Head Dimension

-

Position Embedding

Absolute Position Embedding

RoPE Theta

-

Sliding Window Attention

-

Sliding Window Size

-

Normalization

Layer Normalization

Activation Function

GELU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

60

FFN Intermediate Size (Dense)

-

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

-

Mixture of Experts

Total Expert Parameters

52.0B

Number of Experts

32

Active Experts

2

Shared Experts

-

FFN Intermediate Size (per Expert)

-

Dense Layers Before MoE

-

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 4.1k · Context: 28Kx 60 layersLayerNormPre-AttentionMulti-Head Attention64Q / 64KV headsHead dim: 64+LayerNormPre-FFNSparse MoE FFN (2/32 experts)GELU+Final LayerNormOutput Logits

Hunyuan Large

Hunyuan-DiT is a large-scale Mixture-of-Experts (MoE) diffusion transformer designed for high-fidelity image generation. It represents Tencent's advancement in generative AI, applying a transformer architecture directly to the latent space of image generation. Its primary function is to synthesize diverse and high-quality images from textual prompts, thereby enabling content creation and visual design applications. This model is notable for its modular architecture, allowing efficient scaling and inference.

The Hunyuan-DiT model employs a diffusion transformer architecture, specifically leveraging a Mixture-of-Experts (MoE) design. This architecture partitions the model's parameters into multiple "experts," where only a subset of these experts is activated for each input token during inference. This approach allows the model to achieve a large total parameter count of approximately 389 billion while maintaining a manageable number of active parameters, approximately 52 billion, enhancing computational efficiency. The model incorporates 60 transformer layers with 64 attention heads, utilizing GeLU activation and Layer Normalization. Its design supports flexible image resolutions and uses absolute positional embeddings, integrating Rotary Positional Encoding for enhanced performance. It further utilizes a combination of bilingual CLIP and multilingual T5 encoders for robust text understanding in prompts.

Hunyuan-DiT is engineered for generating high-resolution and visually consistent images, supporting resolutions up to 4096x4096. Its MoE architecture contributes to efficient scaling, making it suitable for deployment in scenarios demanding both high quality and computational prudence. Primary use cases involve creative content generation, visual asset production, and applications requiring advanced text-to-image synthesis capabilities, such as advertising, digital art, and virtual environment design. It also supports multi-turn multimodal dialogue, enabling iterative image refinement based on user interactions.

About Hunyuan

Tencent Hunyuan large language models with various capabilities.


Other Hunyuan Models

Evaluation Benchmarks

Rank

#100

BenchmarkScoreRank

Web Development

WebDev Arena

1326

70

General Text

Text Arena

1326

78

Rankings

Overall Rank

#100

Coding Rank

#78

Model Integrity

Total Score

B-

62 / 100

Hunyuan Large Model Integrity Report

Total Score

62

/ 100

B-

Audit Note

Hunyuan-DiT exhibits strong transparency regarding its Mixture-of-Experts architecture and hardware requirements, providing clear distinctions between total and active parameters. However, it is significantly hampered by a highly restrictive and geographically limited license and a near-total lack of disclosure regarding training compute and specific data sources. While technically well-documented, its legal and environmental transparency profiles remain weak.

Upstream

20.0 / 30

Architectural Provenance

7.5 / 10

The model's architecture is extensively documented in the official technical report and GitHub repository. It is a Mixture-of-Experts (MoE) Diffusion Transformer (DiT) utilizing 60 transformer layers and 64 attention heads. The report details the integration of a bilingual CLIP and a multilingual T5 encoder for text understanding. While the base components are well-described, the specific pre-training methodology for the DiT backbone itself is less detailed than the post-training and dialogue fine-tuning procedures.

Dataset Composition

4.5 / 10

Tencent provides a general overview of the data pipeline, including the use of a Multimodal Large Language Model (MLLM) for caption refinement and a category-balanced dataset (subject, style, scene). However, specific dataset sources, exact proportions of data types, and the total volume of the image-text pairs used for the DiT training are not disclosed. The documentation focuses more on the 'how' of the pipeline rather than the 'what' of the data itself.

Tokenizer Integrity

8.0 / 10

The model utilizes a combination of standard tokenizers: a bilingual CLIP tokenizer and an mT5 tokenizer (specifically t5-v1_1-xxl). These are publicly available and their vocabulary sizes (e.g., 128K for the related Hunyuan-Large LLM, though DiT uses the standard CLIP/T5 ones) are verifiable through the provided model files on Hugging Face. The approach to handling bilingual prompts is clearly documented.

Model

24.5 / 40

Parameter Density

8.5 / 10

The model is transparent about its MoE structure, explicitly stating a total parameter count of 389 billion with approximately 52 billion active parameters. The architectural breakdown (60 layers, 64 heads) and the expert configuration (shared vs. specialized experts) are clearly provided in the technical documentation, allowing for a precise understanding of the model's density and inference efficiency.

Training Compute

2.0 / 10

There is almost no specific information regarding the total compute resources used for training. While the technical report mentions testing on NVIDIA V100 and A100 GPUs, it fails to disclose the total GPU hours, the specific hardware cluster size used for the full training run, the training duration, or the environmental impact/carbon footprint.

Benchmark Reproducibility

5.0 / 10

Tencent provides a custom evaluation protocol involving 50+ human evaluators and some automated metrics. While they name the benchmarks and provide some comparison results against DALL-E 3 and Midjourney, the exact prompts used for all evaluations and the full evaluation codebase are not as comprehensive as those found in top-tier open-source projects. Third-party verification is limited primarily to community-run leaderboards.

Identity Consistency

9.0 / 10

The model and its documentation maintain a consistent identity as 'Hunyuan-DiT' or 'Hunyuan-Large' (depending on the variant). It does not attempt to masquerade as a competitor's model and is transparent about its nature as a Tencent-developed AI. Versioning (v1.1, v1.2) is clearly communicated in the repository and model cards.

Downstream

17.0 / 30

License Clarity

4.0 / 10

The model uses the 'Tencent Hunyuan Community License'. While the terms for commercial use (requiring a separate license if exceeding 100M MAU) are stated, the license contains highly restrictive and unusual geographic clauses. Specifically, it explicitly excludes the European Union, UK, and South Korea from the 'Territory' where the model can be used or its outputs displayed, creating significant legal ambiguity for global users.

Hardware Footprint

7.0 / 10

Hardware requirements are well-documented for various use cases. The repository provides specific VRAM estimates for standard inference (32GB recommended, 11GB minimum) and offers a 'low-VRAM' version capable of running on 6GB. Quantization options (8-bit T5) and acceleration libraries (TensorRT) are also documented with their respective performance impacts.

Versioning Drift

6.0 / 10

The project maintains a clear version history (v1.0, v1.1, v1.2) with a changelog on GitHub. Updates include specific bug fixes (e.g., mitigating oversaturation) and feature additions (ControlNet, IP-Adapter). However, the documentation of performance drift between versions is qualitative rather than quantitative, and older versions are not always easily accessible for comparative testing.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
14k
27k

VRAM Required:

Recommended GPUs