ApX logoApX logo

Qwen2-0.5B

Parameters

500M

Context Length

32.768K

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

7 Jun 2024

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

16

Key-Value Heads

8

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

1,000,000

Sliding Window Attention

No

Sliding Window Size

131,072

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

896

Number of Layers

24

FFN Intermediate Size (Dense)

4,864

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

151,936

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 896 · Context: 32.8k · Vocab: 151.9kx 24 layersRMSNormPre-AttentionGrouped-Query Attention16Q / 8KV headsHead dim: 56+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 4.9k+Final RMSNormOutput Logits

Qwen2-0.5B

The Qwen2-0.5B model represents a compact yet capable entry in the Qwen2 series of large language models, developed by the Qwen team at Alibaba. This model is engineered to deliver foundational language processing functionalities, making it suitable for deployment in environments with constrained computational resources. As a base language model, its primary purpose is to serve as a robust starting point for further specialization through post-training methodologies, such as supervised fine-tuning or reinforcement learning from human feedback. It is designed to facilitate a range of natural language processing tasks efficiently.

About Qwen2

The Alibaba Qwen2 model family comprises large language models built upon the Transformer architecture. It includes both dense and Mixture-of-Experts (MoE) variants, designed for diverse language tasks. Technical features include Grouped Query Attention and support for extended context lengths up to 131,072 tokens, optimizing memory footprint for inference.


Other Qwen2 Models

Evaluation Benchmarks

No evaluation benchmarks for Qwen2-0.5B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

64 / 100

Qwen2-0.5B Model Integrity Report

Total Score

64

/ 100

B

Audit Note

Qwen2-0.5B demonstrates strong transparency regarding its architecture and licensing, providing clear technical specifications and a permissive Apache 2.0 license. However, it suffers from significant opacity in its training data composition and compute resources, relying on vague descriptions of 'high-quality' data without specific source disclosure. While highly accessible for deployment, the lack of verifiable environmental impact data and granular versioning for weight updates limits its overall transparency profile.

Upstream

19.5 / 30

Architectural Provenance

7.5 / 10

The model is explicitly identified as a dense, decoder-only Transformer. The technical report and official documentation detail the use of SwiGLU activation, Rotary Position Embeddings (RoPE), and RMSNorm. It specifically notes the use of Grouped Query Attention (GQA) with 14 query heads and 2 key-value heads for this variant. While the pre-training methodology is described as next-token prediction followed by post-training (SFT and DPO), the specific architectural modifications for the 0.5B scale compared to larger variants are well-documented in the technical report's hyper-parameter tables.

Dataset Composition

3.5 / 10

Alibaba discloses that Qwen2-0.5B was pre-trained on a 12 trillion token dataset, which is larger but of lower 'quality threshold' than the 7 trillion token set used for larger models. However, the actual composition breakdown (e.g., percentage of web, code, books) is not provided. The sources are described vaguely as 'large-scale high-quality multilingual' data without naming specific datasets or providing a verifiable distribution. Filtering and cleaning methodologies are mentioned as 'meticulous' but lack public technical specifics for reproduction.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly available via the Hugging Face 'transformers' library and GitHub. It uses byte-level Byte-Pair Encoding (BBPE) with a large vocabulary size of 151,646 tokens, which is consistent across the Qwen2 family. Documentation confirms it is designed for multilingual support (29+ languages) and code, with specific control tokens for chat and tool use. The vocabulary size and pre-tokenization rules are explicitly stated in the technical report and model configuration files.

Model

24.0 / 40

Parameter Density

9.0 / 10

The model's parameter count is precisely disclosed as 0.49 billion total parameters, with 0.36 billion non-embedding parameters. As a dense model, all parameters are active during inference, which is clearly stated to distinguish it from the MoE variants in the same family. The architectural breakdown, including the number of layers (24) and hidden dimension size (896), is fully transparent in the technical report.

Training Compute

2.0 / 10

There is no public disclosure of the specific GPU/TPU hours, hardware cluster specifications, or total energy consumption used to train the 0.5B variant. While the technical report mentions general training stability techniques and batch sizes, it lacks the verifiable compute metrics required for a high score. No carbon footprint calculations or estimated training costs are provided by the developer.

Benchmark Reproducibility

5.0 / 10

Alibaba provides results for standard benchmarks (MMLU, HumanEval, GSM8K) with specified shot counts (e.g., 5-shot for MMLU). However, the exact evaluation prompts and full reproduction code for the base model's specific results are not as detailed as the instruction-tuned variants. While some evaluation code is on GitHub, the lack of a comprehensive, one-click reproduction suite for the 0.5B base model results limits its score.

Identity Consistency

8.0 / 10

The model consistently identifies as part of the Qwen2 family and is transparent about its status as a base model not intended for direct chat without fine-tuning. It does not exhibit the identity confusion seen in some other open-weights models that claim to be GPT-4. Versioning is clear, distinguishing it from the later Qwen2.5-0.5B release.

Downstream

20.0 / 30

License Clarity

9.0 / 10

The model is released under the Apache 2.0 license, which is a standard, permissive open-source license. This is explicitly stated on the Hugging Face repository, the official blog, and the GitHub repository. The terms for commercial use and derivative works are clear and follow standard Apache 2.0 protocols without the restrictive 'Qwen License' applied to the 72B and 3B variants.

Hardware Footprint

7.0 / 10

VRAM requirements are well-documented by both the official team and third-party communities. The model requires approximately 1GB for weights (FP16) and roughly 2GB total for inference. Documentation on the impact of context length on memory (supporting up to 32K/128K tokens) is available, though official quantization-accuracy tradeoff curves for the 0.5B variant specifically are less detailed than for the 7B+ models.

Versioning Drift

4.0 / 10

While the model uses clear naming (Qwen2-0.5B), there is no detailed public changelog for weight updates or a formal system for tracking silent drift. The transition to Qwen2.5 is documented, but intermediate updates to the Qwen2 weights lack granular versioning. Users have reported non-deterministic behavior and performance changes in related variants without clear documentation from the provider.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
16k
32k

VRAM Required:

Recommended GPUs