Parameters
4B
Context Length
32.768K
Modality
Text
Architecture
Dense
License
Apache 2.0
Release Date
29 Apr 2025
Knowledge Cutoff
Mar 2025
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
48
Key-Value Heads
8
Attention Head Dimension
128
Position Embedding
ROPE
RoPE Theta
1,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
Swish
Dimensions
Hidden Dimension Size
4,096
Number of Layers
40
FFN Intermediate Size (Dense)
9,728
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
151,936
Qwen3-4B is a 4-billion parameter dense causal language model developed by Alibaba, belonging to the third generation of the Qwen series. A fundamental innovation in this model is its unified architecture that supports dual-mode operation, allowing for dynamic switching between 'thinking' and 'non-thinking' states. In the thinking mode, the model performs extensive, multi-step logical reasoning similar to chain-of-thought processing, making it effective for complex mathematical problems and intricate code generation. Conversely, the non-thinking mode is optimized for low-latency, direct responses in general conversational contexts, providing an efficient alternative for tasks where depth of reasoning is secondary to speed.
Technically, the model is built on a transformer architecture with 36 layers and 4.0 billion total parameters. It utilizes Grouped Query Attention (GQA) with 32 attention heads for queries and 8 key-value heads, ensuring high computational throughput during inference. The model employs Rotary Position Embeddings (RoPE) and is natively trained on a 32,768-token context window, which can be extended up to 131,072 tokens using YaRN scaling. This architectural foundation is further refined through a three-stage pre-training pipeline involving 36 trillion tokens across 119 languages, prioritizing a mix of high-quality STEM, coding, and multilingual data to ensure broad-spectrum proficiency.
Qwen3-4B is designed for versatility in deployment, particularly in environments requiring sophisticated reasoning within a compact parameter footprint. Its native support for thinking modes allows it to function as a reasoning engine for complex instruction following and agentic workflows without requiring a separate specialized model. The integration of SwiGLU activations and RMSNorm ensures stable training dynamics, while the inclusion of 'tied embeddings' specifically in the smaller variants like the 4B model helps optimize memory usage. It is highly effective for cross-lingual tasks, tool-based interactions, and structured output generation across a wide variety of domains.
The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.
Rank
#57
| Benchmark | Score | Rank |
|---|---|---|
General Knowledge MMLU | 0.815 | 20 |
Overall Rank
#57
Coding Rank
-
Total Score
76
/ 100
Qwen3-4B exhibits a high level of transparency regarding its architecture and licensing, supported by a detailed technical report and a permissive Apache 2.0 license. The model provides clear documentation on its unique dual-mode reasoning capabilities and hardware requirements for various quantization levels. However, transparency is significantly limited regarding the specific compute resources and environmental impact of its training, as well as the precise composition of its 36-trillion-token dataset.
Architectural Provenance
The model's architecture is extensively documented in the Qwen3 Technical Report (arXiv:2505.09388). It is a dense causal language model with 36 layers, utilizing Grouped Query Attention (GQA) with 32 query heads and 8 KV heads. Key technical components such as SwiGLU activations, RMSNorm with pre-normalization, and Rotary Position Embeddings (RoPE) are explicitly detailed. A significant innovation, the 'dual-mode' thinking/non-thinking architecture, is described as a unified framework that allows dynamic switching between reasoning states. The training methodology is clearly outlined as a three-stage pipeline (General, Reasoning, and Long Context stages).
Dataset Composition
Alibaba discloses that the model was trained on a massive 36 trillion token dataset covering 119 languages. The documentation provides a high-level breakdown of data types, including web data, books, STEM, coding, and synthetic data generated by previous Qwen models (Qwen2.5-Math/Coder). The three-stage data curriculum is well-defined, showing how the data mix evolved from general knowledge to reasoning-intensive content. However, specific percentage breakdowns of the 36T tokens (e.g., exact ratios of web vs. books) and the specific identities of the non-web datasets are not fully disclosed, which is a common gap in large-scale model transparency.
Tokenizer Integrity
The tokenizer is publicly available via the official GitHub repository and Hugging Face. It uses byte-level byte-pair encoding (BBPE) with a large vocabulary size of 151,669 tokens, which is consistent across the Qwen series. The documentation confirms support for 119 languages, and the tokenizer's alignment with this claim is verifiable through the provided configuration files. The use of 'tied embeddings' for the 4B variant is also explicitly documented to optimize memory efficiency.
Parameter Density
The parameter count is precisely stated as 4.0 billion total parameters, with a non-embedding parameter count of 3.6 billion. As a dense model, all parameters are active during inference, which is clearly distinguished from the MoE variants in the same family (e.g., Qwen3-30B-A3B). The architectural breakdown, including the number of layers (36) and attention head configurations, is fully transparent in the technical report.
Training Compute
While the technical report describes the training stages and the scale of the data (36T tokens), it conspicuously lacks specific details regarding the compute resources used. There is no disclosure of total GPU/TPU hours, the specific hardware clusters utilized for the 4B variant, or the estimated carbon footprint. This information is treated as proprietary, which is a significant transparency gap according to the scoring guidelines.
Benchmark Reproducibility
The model is evaluated on a wide array of standard benchmarks (MMLU, GSM8K, HumanEval, etc.) with results published in the technical report. Evaluation settings, such as the use of 'thinking mode' for reasoning tasks and specific sampling parameters (Temperature=0.6, TopP=0.95), are provided. The team recommends standardized prompts for benchmarking, such as 'Please reason step by step'. While the full evaluation codebase is not a single-click reproduction script, the level of detail regarding prompts and modes is significantly higher than industry averages.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as part of the Qwen3 family. It is transparent about its dual-mode capabilities, and the system is designed to explicitly signal its state (e.g., using <think> tags). There are no documented instances of the model claiming to be a competitor's product or misrepresenting its 4B scale.
License Clarity
The model weights and associated code are released under the Apache 2.0 license, which is a highly permissive, standard open-source license. This allows for commercial use, modification, and distribution without the 'open-weights but restricted' ambiguity found in some other model families. The licensing terms are clearly stated on Hugging Face and the official GitHub repository.
Hardware Footprint
Hardware requirements are well-documented for various deployment scenarios. Official documentation and third-party sources provide VRAM estimates for FP16 (approx. 8-10GB) and quantized versions (4-bit requiring ~2-4GB). The impact of context length on memory is addressed, with native support for 32K tokens and scaling up to 131K via YaRN. Quantization support is verified through the availability of GGUF and other formats.
Versioning Drift
The model follows a clear versioning path (Qwen3-4B vs. updated versions like Qwen3-4B-Instruct-2507). A GitHub repository is maintained to track issues and updates. However, while version numbers are present, a detailed, granular changelog documenting specific weight drifts or performance changes between minor iterations is less comprehensive than a full semantic versioning system.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online