ApX logoApX logo

Qwen2.5-3B

Parameters

3B

Context Length

33K

Modality

Text

Architecture

Dense

License

Qwen Research License Agreement

Release Date

19 Sept 2024

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

48

Key-Value Heads

8

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

1,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

2,304

Number of Layers

36

FFN Intermediate Size (Dense)

11,008

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

151,936

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 2.3k · Context: 33K · Vocab: 151.9kx 36 layersRMSNormPre-AttentionGrouped-Query Attention48Q / 8KV headsHead dim: 48+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 11k+Final RMSNormOutput Logits

Qwen2.5-3B

Qwen2.5-3B is a foundational large language model developed by Alibaba Cloud, forming a part of the broader Qwen2.5 series. This model is primarily designed for advanced natural language processing tasks, serving as a robust base model that can be further fine-tuned for specific applications. Its core purpose is to process and generate human-like text, with capabilities extended to more complex domains such as programming and mathematical problem-solving through specialized variants.

The architectural design of Qwen2.5-3B is based on the Transformer framework, integrating several key innovations for enhanced performance and efficiency. It incorporates Rotary Position Embedding (RoPE) for effective handling of sequence positions, SwiGLU as its activation function for improved non-linearity, and RMSNorm for stable normalization across layers. The model employs Grouped-Query Attention (GQA), specifically configured with 16 query heads and 2 key-value heads, which optimizes inference efficiency by reducing the memory footprint of key and value caches during sequence generation. Comprising 36 layers and a total of 3.09 billion parameters, this dense architecture is engineered for a balance of capability and computational feasibility.

Qwen2.5-3B supports a substantial context length of up to 32,768 tokens, enabling the processing of extensive textual inputs while maintaining coherence. For certain applications or instruction-tuned versions, it can support contexts up to 128,000 tokens. The model demonstrates proficiency in instruction following and the generation of structured outputs, such as JSON. It offers broad multilingual support, encompassing over 29 languages, making it suitable for global applications requiring diverse language understanding and generation capabilities. Its design focuses on providing a capable foundation for various text-based AI applications.

About Qwen2.5

Qwen2.5 by Alibaba is a family of dense, decoder-only language models available in various sizes, with some variants utilizing Mixture-of-Experts. These models are pretrained on large-scale datasets, supporting extended context lengths and multilingual communication. The family includes specialized models for coding, mathematics, and multimodal tasks, such as vision and audio processing.


Other Qwen2.5 Models

Evaluation Benchmarks

No evaluation benchmarks for Qwen2.5-3B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

65 / 100

Qwen2.5-3B Model Integrity Report

Total Score

65

/ 100

B

Audit Note

Qwen2.5-3B exhibits strong transparency in its architectural specifications and tokenizer design, providing clear technical details for implementation. However, it suffers from significant opacity regarding its training data sources and compute resources. While the model is highly accessible, the use of a non-standard research license and unresolved concerns regarding benchmark integrity limit its overall transparency profile.

Upstream

21.5 / 30

Architectural Provenance

8.0 / 10

The Qwen2.5-3B architecture is comprehensively documented in the official technical report and Hugging Face model cards. It is a dense, decoder-only Transformer utilizing Grouped-Query Attention (GQA) with 16 query heads and 2 KV heads, SwiGLU activation, RMSNorm, and Rotary Positional Embeddings (RoPE). The model specifies 36 layers and an embedding dimension of 2048. While the training methodology (pre-training followed by SFT and RLHF/GRPO) is described, the specific hyperparameters for the 3B variant's training run are less detailed than the flagship 72B model.

Dataset Composition

4.5 / 10

Alibaba discloses that the model was trained on 18 trillion tokens, a significant increase from previous versions. However, the exact composition is described only in general categories: high-quality web data, code, and mathematics. While they mention filtering and the use of synthetic data generated by larger Qwen models for math and code, they do not provide a precise percentage breakdown (e.g., web: X%, code: Y%) or name specific data sources, citing quality curation processes instead of providing a full provenance.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly available via the 'qwen.tiktoken' and Hugging Face 'tokenization_qwen2.py' files. It uses Byte-Level Byte Pair Encoding (BBPE) with a large vocabulary of 151,643 regular tokens. Documentation explicitly states its efficiency for multilingual support (29+ languages) and provides compression rate comparisons. The vocabulary is consistent across the entire Qwen2.5 family, and the approach to handling control tokens is well-documented.

Model

24.5 / 40

Parameter Density

8.5 / 10

The parameter count is precisely disclosed as 3.09 billion total parameters, with 2.77 billion non-embedding parameters. As a dense model, all parameters are active during inference, which is clearly stated. The architectural breakdown (layers, heads, dimensions) is fully provided in the model configuration files and technical report, leaving no ambiguity regarding its density or structure.

Training Compute

3.0 / 10

Information regarding the specific compute resources used to train the 3B variant is largely absent. While the technical report mentions the use of large-scale GPU clusters for the series, it does not disclose the specific GPU hours, hardware type (e.g., H100 vs A100), or the carbon footprint associated with the 3B model's training. This is a significant gap compared to Western counterparts like Llama 3.1.

Benchmark Reproducibility

4.0 / 10

While Alibaba provides extensive benchmark results across standard sets (MMLU, HumanEval, MATH), they do not provide the exact evaluation code or the specific prompts/few-shot templates used for the 3B variant. Third-party researchers have raised significant concerns regarding data contamination in the Qwen2.5 series, particularly in mathematical benchmarks, which Alibaba has not addressed with a public audit or contamination analysis for this specific model.

Identity Consistency

9.0 / 10

The model consistently identifies itself as part of the Qwen series and is transparent about its versioning (2.5). It does not exhibit the identity confusion seen in some other models (e.g., claiming to be GPT-4). The model card and system prompts are designed to maintain a clear identity, and the model is generally aware of its capabilities and limitations as a 3B parameter model.

Downstream

18.5 / 30

License Clarity

6.0 / 10

The model is released under the 'Qwen Research License Agreement'. While the terms are publicly accessible, it is not a standard Open Source license like Apache 2.0 (which is used for other sizes in the same family). The license includes restrictions on commercial use (requiring a separate request for a commercial license) and contains 'Materials' definitions that can be legally complex, creating more friction than standard permissive licenses.

Hardware Footprint

7.5 / 10

VRAM requirements are well-documented by both the provider and the community. Official documentation notes support for context lengths up to 128K, with clear guidance on memory scaling. Quantization support (GPTQ, AWQ, GGUF) is extensively documented with performance/memory trade-offs provided in the technical report and community benchmarks, making it easy for users to estimate hardware needs.

Versioning Drift

5.0 / 10

Alibaba uses a versioning system (Qwen1.5, Qwen2, Qwen2.5), but detailed changelogs for minor updates or weight refreshes are often missing. There is no formal mechanism for tracking 'silent' updates to the weights on Hugging Face, and while the major versions are distinct, the lack of a granular versioning history for the 3B variant makes it difficult to track behavioral drift over time.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
16k
32k

VRAM Required:

Recommended GPUs