Hunyuan TurboS: Specifications and GPU VRAM Requirements

Hunyuan TurboS

Closed Source

Closed Weights

Parameters

52B

Context Length

32K

Modality

Text

Architecture

Dense

License

Release Date

16 Jul 2025

Knowledge Cutoff

Dec 2024

Technical Specifications

Attention Structure

Multi-Head Attention

Hidden Dimension Size

Number of Layers

128

Attention Heads

Key-Value Heads

Activation Function

Normalization

Position Embedding

Absolute Position Embedding

System Requirements

VRAM requirements for different quantization methods and context sizes

Hunyuan TurboS

Tencent Hunyuan TurboS represents a significant advancement in large language models, engineered to deliver both rapid response times and robust reasoning capabilities. This model integrates a dual cognitive approach, analogous to human "fast thinking," to enable near-instantaneous replies for a broad spectrum of queries. Its design prioritizes efficiency and responsiveness, making it suitable for applications that demand quick, high-quality interactions. The model effectively balances speed with the capacity to address complex informational and analytical tasks, supporting a flexible approach to problem-solving.

Architecturally, Hunyuan TurboS is a novel hybrid Transformer-Mamba Mixture of Experts (MoE) model. This innovative fusion combines the strengths of Mamba2 layers, which excel at efficient processing of long sequences and reduced KV-Cache memory footprint, with the Transformer's established capacity for deep contextual understanding. The model incorporates 128 layers, comprising 57 Mamba2 layers, 7 Attention layers, and 64 Feed-Forward Network (FFN) layers. The FFN layers specifically utilize an MoE structure with 32 experts, where each token activates 1 shared and 2 specialized experts, enhancing computational efficiency. Furthermore, the model employs Grouped-Query Attention (GQA) to optimize memory usage and computational overhead during inference.

Hunyuan TurboS is designed to handle extensive information, supporting an ultra-long context length of 256,000 tokens. This capability allows the model to maintain performance across lengthy documents and extended dialogues. Its post-training strategy includes supervised fine-tuning and adaptive long-short Chain-of-Thought (CoT) fusion, enabling dynamic switching between rapid responses for simple queries and more analytical, step-by-step processing for intricate problems. The model is deployed for various applications requiring efficient, high-performance AI, such as advanced conversational agents, content generation, and sophisticated analytical systems.

About Hunyuan

Tencent Hunyuan large language models with various capabilities.

Other Hunyuan Models

Evaluation Benchmarks

Ranking is for Local LLMs.

No evaluation benchmarks for Hunyuan TurboS available.

Rankings

Overall Rank

Coding Rank

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

16k

31k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Read the Paper