ApX logoApX logo

Qwen2.5-0.5B

Parameters

500M

Context Length

32.768K

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

19 Sept 2024

Knowledge Cutoff

-

Technical Specifications

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

768

Number of Layers

24

Attention Heads

16

Key-Value Heads

8

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

ROPE

Qwen2.5-0.5B

Qwen2.5-0.5B is a foundational large language model developed by the Qwen team at Alibaba Cloud. It is part of the Qwen2.5 series, which represents an advancement in language model capabilities, featuring improvements in knowledge acquisition, coding proficiency, and mathematical reasoning. This variant, with approximately 0.49 billion parameters, serves as a robust base model, primarily designed for pretraining and subsequent fine-tuning for specialized applications. Its architecture is engineered to handle complex language tasks efficiently across multiple languages.

Architecturally, Qwen2.5-0.5B is a dense, decoder-only Transformer model. It incorporates Rotary Position Embedding (RoPE) for effective positional encoding, SwiGLU as its activation function, and RMSNorm for normalization. The attention mechanism utilizes Grouped Query Attention (GQA), specifically configured with 14 query heads and 2 key-value heads for this model size. The model is structured with 24 layers, contributing to its depth and capacity for learning intricate patterns in language data.

As a causal language model, Qwen2.5-0.5B is suitable for a range of downstream applications following post-training processes such as supervised fine-tuning or reinforcement learning from human feedback. Its capabilities include instruction following, generating extended text sequences, and processing structured data formats like JSON. The model supports a full context length of 32,768 tokens, with the broader Qwen2.5 series capable of handling contexts up to 128,000 tokens and generating outputs up to 8,000 tokens. It offers multilingual support, encompassing over 29 languages.

About Qwen2.5

Qwen2.5 by Alibaba is a family of dense, decoder-only language models available in various sizes, with some variants utilizing Mixture-of-Experts. These models are pretrained on large-scale datasets, supporting extended context lengths and multilingual communication. The family includes specialized models for coding, mathematics, and multimodal tasks, such as vision and audio processing.


Other Qwen2.5 Models

Evaluation Benchmarks

No evaluation benchmarks for Qwen2.5-0.5B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Transparency

Total Score

B

67 / 100

Qwen2.5-0.5B Transparency Report

Total Score

67

/ 100

B

Audit Note

Qwen2.5-0.5B demonstrates strong transparency in its architectural specifications, licensing, and tokenizer implementation, providing clear technical details for developers. However, it significantly lacks disclosure regarding training compute resources and granular dataset composition. While benchmark results are provided, concerns regarding their reproducibility and the lack of environmental impact data represent notable gaps in its transparency profile.

Upstream

21.0 / 30

Architectural Provenance

8.0 / 10

The model's architecture is extensively documented in the Qwen2.5 technical report and official GitHub repository. It is a dense, decoder-only Transformer utilizing Rotary Position Embedding (RoPE), SwiGLU activation, and RMSNorm. Specifically for the 0.5B variant, the Grouped Query Attention (GQA) configuration is detailed with 14 query heads and 2 key-value heads across 24 layers. The transition from the Qwen2 base is clearly explained, and the model weights are publicly accessible on Hugging Face with clear configuration files.

Dataset Composition

4.0 / 10

While the total token count is disclosed (expanded from 7 trillion in Qwen2 to 18 trillion in Qwen2.5), the specific composition breakdown (e.g., percentages of web, code, and math data) is not provided for the general 0.5B base model. Documentation mentions 'massive high-quality domain-balanced training sets' and 'expertly curated' data but lacks a granular public breakdown of sources or specific filtering thresholds, relying on high-level descriptions of data types.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly available and fully documented. It uses Byte-level Byte Pair Encoding (BBPE) with a vocabulary size of 151,643 regular tokens and 3 control tokens, ensuring no 'unknown' words. The vocabulary is shared across all Qwen2.5 model sizes, and the compression rates and multilingual efficiency are verified in the technical report. Tokenizer configuration files (tokenizer.json, vocab.json) are accessible in the official repositories.

Model

23.5 / 40

Parameter Density

8.5 / 10

The parameter count is precisely stated as 0.49 billion total, with a further breakdown of 0.36 billion non-embedding parameters. As a dense model, all parameters are active during inference, which is explicitly confirmed in the technical documentation. The architectural specifications (layers, hidden dimensions, and attention heads) are clearly mapped to the parameter count.

Training Compute

2.0 / 10

Information regarding training compute is extremely limited. While the hardware type (NVIDIA A100/H100) is implied by the scale of the project and mentioned in inference benchmarks, the specific GPU hours, total compute budget, and carbon footprint for the 0.5B variant are not disclosed. The technical report focuses on performance metrics rather than resource expenditure.

Benchmark Reproducibility

4.0 / 10

The model provides scores for standard benchmarks (MMLU, MATH, HumanEval) in its technical report. However, while evaluation code is available on GitHub, the exact few-shot prompts and specific versions for all benchmarks are not consistently detailed for the 0.5B variant. Independent researchers have noted significant performance drops on 'clean' versions of benchmarks released after the model's training cutoff, suggesting potential issues with the reported scores' generalizability.

Identity Consistency

9.0 / 10

The model consistently identifies itself as part of the Qwen series developed by Alibaba Cloud. It maintains clear versioning (Qwen2.5-0.5B) and distinguishes between its base and instruction-tuned variants. There are no reported instances of the model claiming to be a competitor's product or misrepresenting its foundational architecture.

Downstream

22.5 / 30

License Clarity

9.5 / 10

The model is released under the Apache 2.0 license, which is a standard, permissive open-source license allowing for commercial use, modification, and distribution. The license is clearly stated on the official GitHub, Hugging Face repository, and in the technical report, with no conflicting proprietary terms found for this specific variant.

Hardware Footprint

8.0 / 10

VRAM requirements are well-documented for various precisions (BF16, INT8, INT4). Official documentation and third-party tools provide specific memory footprints (e.g., ~0.97GB for BF16 at 1k context) and scaling data for context lengths up to 32,768 tokens. Quantization tradeoffs are also addressed with speed and memory benchmarks provided for GPTQ and AWQ formats.

Versioning Drift

5.0 / 10

The model uses a clear versioning scheme (2.5), and a changelog is maintained on the official GitHub. However, there is limited documentation regarding long-term drift or specific weight updates within the 2.5 release cycle. While the transition from 2.0 to 2.5 is documented, the granularity of updates for the 0.5B variant specifically is moderate.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
16k
32k

VRAM Required:

Recommended GPUs

Qwen2.5-0.5B: Specifications and GPU VRAM Requirements