趋近智
注意力结构
Grouped-Query Attention
隐藏维度大小
4096
层数
40
注意力头
64
键值头
8
激活函数
SwigLU
归一化
Layer Normalization
位置嵌入
ROPE
不同量化方法和上下文大小的显存要求
Qwen3-8B is a dense causal language model developed by Alibaba, part of the broader Qwen3 series. It consists of approximately 8.2 billion parameters and is engineered for efficient performance across a spectrum of natural language processing tasks. A distinctive feature within the Qwen3 family is the integration of a "thinking" mode for complex logical reasoning, mathematics, and coding, alongside a "non-thinking" mode optimized for general-purpose dialogue. This design facilitates dynamic adaptation of the model's operational characteristics based on task demands without requiring a switch between distinct models.
The architectural foundation of Qwen3-8B is the decoder-only transformer, incorporating refinements such as qk layernorm for enhanced stability and leveraging Grouped Query Attention (GQA) to optimize inference speed and memory utilization by sharing Key/Value heads among multiple Query heads. Its training regimen is a three-stage process, involving extensive pre-training on over 36 trillion tokens across 119 languages to build broad language proficiency and general knowledge. This initial stage (S1) is followed by specific optimization for reasoning skills in a second stage (S2) by increasing the proportion of STEM, coding, and reasoning data, and long-context comprehension in a third stage by extending training sequence lengths up to 32,768 tokens natively. The context length can be further extended to 131,072 tokens via the YaRN method.
Qwen3-8B exhibits enhanced reasoning capabilities and superior human preference alignment, making it effective for applications requiring creative writing, role-playing, multi-turn dialogues, and precise instruction following. Furthermore, it includes agent capabilities, supporting integration with external tools for complex agent-based tasks. The model's comprehensive multilingual support extends to over 100 languages and dialects, facilitating multilingual instruction following and translation.
The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.
排名适用于本地LLM。
没有可用的 Qwen3-8B 评估基准。