Parameters
8B
Context Length
131.072K
Modality
Text
Architecture
Dense
License
Apache 2.0
Release Date
29 Apr 2025
Knowledge Cutoff
-
Attention Structure
Grouped-Query Attention
Hidden Dimension Size
4096
Number of Layers
40
Attention Heads
64
Key-Value Heads
8
Activation Function
SwigLU
Normalization
Layer Normalization
Position Embedding
ROPE
VRAM requirements for different quantization methods and context sizes
Qwen3-8B is a dense causal language model developed by Alibaba, part of the broader Qwen3 series. It consists of approximately 8.2 billion parameters and is engineered for efficient performance across a spectrum of natural language processing tasks. A distinctive feature within the Qwen3 family is the integration of a "thinking" mode for complex logical reasoning, mathematics, and coding, alongside a "non-thinking" mode optimized for general-purpose dialogue. This design facilitates dynamic adaptation of the model's operational characteristics based on task demands without requiring a switch between distinct models.
The architectural foundation of Qwen3-8B is the decoder-only transformer, incorporating refinements such as qk layernorm for enhanced stability and leveraging Grouped Query Attention (GQA) to optimize inference speed and memory utilization by sharing Key/Value heads among multiple Query heads. Its training regimen is a three-stage process, involving extensive pre-training on over 36 trillion tokens across 119 languages to build broad language proficiency and general knowledge. This initial stage (S1) is followed by specific optimization for reasoning skills in a second stage (S2) by increasing the proportion of STEM, coding, and reasoning data, and long-context comprehension in a third stage by extending training sequence lengths up to 32,768 tokens natively. The context length can be further extended to 131,072 tokens via the YaRN method.
Qwen3-8B exhibits enhanced reasoning capabilities and superior human preference alignment, making it effective for applications requiring creative writing, role-playing, multi-turn dialogues, and precise instruction following. Furthermore, it includes agent capabilities, supporting integration with external tools for complex agent-based tasks. The model's comprehensive multilingual support extends to over 100 languages and dialects, facilitating multilingual instruction following and translation.
The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.
Ranking is for Local LLMs.
No evaluation benchmarks for Qwen3-8B available.
Overall Rank
-
Coding Rank
-
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens