Qwen2.5-32B: Specifications and GPU VRAM Requirements

Qwen2.5-32B

Open Source

Open Weights

Parameters

32B

Context Length

131.072K

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

19 Sept 2024

Knowledge Cutoff

Mar 2024

Technical Specifications

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

8192

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

Qwen2.5-32B

The Qwen2.5-32B model is a significant component of the Qwen2.5 series of large language models, developed by the Qwen team at Alibaba Cloud. This iteration builds upon its predecessors by offering enhanced capabilities for a broad spectrum of natural language processing tasks. Its design prioritizes robust instruction following, effective long-text generation, and sophisticated comprehension and production of structured data, including JSON formats. The model also demonstrates improved stability when confronted with diverse system prompts, which is advantageous for developing conversational agents and setting specific dialogue conditions. Furthermore, it provides comprehensive multilingual support across more than 29 languages, expanding its applicability in global contexts.

Architecturally, Qwen2.5-32B is a dense, decoder-only transformer model. It integrates several advanced components to optimize performance and efficiency. These include Rotary Position Embeddings (RoPE) for effective positional encoding, SwiGLU as the activation function for enhanced non-linearity, and RMSNorm for stable training and improved convergence. To optimize inference speed and Key-Value cache utilization, the model employs Grouped Query Attention (GQA). The underlying training regimen involved a massive dataset, expanded to approximately 18 trillion tokens, which contributed to its enriched knowledge base, particularly in domains such as coding, mathematics, and various languages.

The operational characteristics of Qwen2.5-32B demonstrate notable performance across various complex tasks. This model variant is adept at handling extended contexts, supporting sequences up to 131,072 tokens. Its ability to generate long texts, with outputs extending up to 8,192 tokens, makes it suitable for applications requiring detailed responses or extensive content creation. While the base model is general-purpose, the architectural foundations of Qwen2.5 have also been utilized in specialized variants, such as those optimized for coding or multimodal vision-language tasks, underscoring the versatility of the Qwen2.5 framework.

About Qwen2.5

Qwen2.5 by Alibaba is a family of dense, decoder-only language models available in various sizes, with some variants utilizing Mixture-of-Experts. These models are pretrained on large-scale datasets, supporting extended context lengths and multilingual communication. The family includes specialized models for coding, mathematics, and multimodal tasks, such as vision and audio processing.

Other Qwen2.5 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

Rank

#18

Benchmark	Score	Rank
Refactoring Aider Refactoring	0.73	🥇 1
Coding Aider Coding	0.73	🥉 3
QA Assistant ProLLM QA Assistant	0.95	4
StackEval ProLLM Stack Eval	0.9	6
Summarization ProLLM Summarization	0.74	7
Web Development WebDev Arena	902.26	7
Professional Knowledge MMLU Pro	0.69	11
Graduate-Level QA GPQA	0.49	15
General Knowledge MMLU	0.49	23

Rankings

Overall Rank

#18

Coding Rank

#11

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

64k

128k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Download Weights Source Code