Active Parameters
30B
Context Length
131K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
29 Apr 2025
Knowledge Cutoff
Mar 2025
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
96
Key-Value Heads
8
Attention Head Dimension
128
Position Embedding
ROPE
RoPE Theta
1,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
Layer Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
4,096
Number of Layers
60
FFN Intermediate Size (Dense)
768
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
151,936
Mixture of Experts
Total Expert Parameters
3.0B
Number of Experts
128
Active Experts
8
Shared Experts
-
FFN Intermediate Size (per Expert)
768
Dense Layers Before MoE
-
The Qwen3-30B-A3B model is a Mixture-of-Experts (MoE) language model developed by Alibaba, engineered to deliver high-performance inference with reduced computational costs. It features a total of 30.5 billion parameters, but employs a sparse activation strategy where only approximately 3.3 billion parameters are engaged per token. This design allows the model to maintain the broad knowledge and capabilities of a larger system while operating with the latency and resource profile of a significantly smaller dense architecture. It serves as a middle-tier solution within the Qwen3 family, balancing sophistication with operational efficiency.
Technically, the model is structured with 48 transformer layers and utilizes Grouped Query Attention (GQA) with 32 query heads and 4 key-value heads to optimize memory bandwidth and inference speed. The MoE component consists of 128 experts, with 8 experts selected via a routing mechanism for each token. A notable architectural innovation is the hybrid system that supports both a reasoning-heavy thinking mode for complex mathematical and logic tasks and a non-thinking mode for streamlined, general-purpose conversation. This flexibility is supported by training on a massive 36 trillion token corpus spanning 119 languages, incorporating advanced techniques such as Rotary Position Embedding (RoPE) and SwiGLU activation.
Designed for versatile deployment, Qwen3-30B-A3B excels in instruction following, code generation, and complex agentic workflows where it can integrate with external tools. The model supports a native context window of 32,768 tokens, which can be extended to 131,072 tokens using the YaRN (Yet another RoPE N) scaling method, and further iterations have pushed these limits to 256,000 tokens. Its robust multilingual foundation and optimized expert routing make it suitable for various downstream applications ranging from technical reasoning to creative content generation in professional environments.
The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.
Rank
#144
| Benchmark | Score | Rank |
|---|---|---|
General Knowledge MMLU | 0.876 | 9 |
Mathematics LiveBench Mathematics | 0.65 | 45 |
Web Development WebDev Arena | 1384 | 45 |
Data Analysis LiveBench Data Analysis | 0.45 | 52 |
Coding LiveBench Coding | 0.49 | 55 |
Reasoning LiveBench Reasoning | 0.37 | 57 |
Agentic Coding LiveBench Agentic | 0.02 | 58 |
General Text Text Arena | 1327 | 76 |
Overall Rank
#144
Coding Rank
#141
Total Score
75
/ 100
The model exhibits a high level of transparency regarding its architectural design and parameter density, particularly in its clear disclosure of active versus total parameters for its Mixture-of-Experts structure. It is backed by a permissive Apache 2.0 license and detailed technical reporting on its unique hybrid reasoning modes. However, transparency is more limited regarding the specific composition of its 36-trillion-token training set and the total compute resources expended during its development.
Architectural Provenance
The model's architecture is extensively documented in the Qwen3 Technical Report (arXiv:2505.09388). It is a Mixture-of-Experts (MoE) transformer with 48 layers, utilizing Grouped Query Attention (GQA) with 32 query heads and 4 KV heads. The MoE design features 128 experts with 8 active per token. Key technical components like SwiGLU activation, RoPE (Rotary Position Embeddings), and RMSNorm are explicitly detailed. The report also describes a unique 'thinking mode' hybrid system and a three-stage pre-training methodology (General, Reasoning, and Long-context).
Dataset Composition
Alibaba discloses that the model was trained on a 36 trillion token corpus spanning 119 languages. While the general categories of data are mentioned—including web data, books, PDFs, and synthetic data generated by previous Qwen models (Qwen2.5-VL for extraction, Qwen2.5-Math/Coder for synthetic generation)—there is no precise percentage breakdown of the dataset composition (e.g., exact ratios of code vs. web vs. books). The filtering and cleaning methodologies are described at a high level but lack granular technical specifics.
Tokenizer Integrity
The tokenizer is publicly available via the Hugging Face repository and official Qwen GitHub. It uses Byte Pair Encoding (BPE) with a vocabulary size of 151,936 tokens. It supports 119 languages and dialects, which is verified by the model's extensive multilingual benchmark performance. Documentation provides clear instructions for handling special tokens, including the <think> tags used in reasoning mode.
Parameter Density
The model provides exemplary transparency regarding its parameter count. It explicitly states a total of 30.5 billion parameters, with a non-embedding parameter count of 29.9 billion. Crucially for an MoE model, it clearly discloses that only 3.3 billion parameters are active per token during inference. The architectural breakdown of 128 total experts and 8 activated experts is consistently reported across all official documentation.
Training Compute
While the technical report mentions the use of scaling laws to tune hyperparameters and the scale of the training (36T tokens), it lacks specific details on the total GPU/TPU hours consumed, the specific hardware clusters used for the full training run, or the estimated carbon footprint. The information provided is limited to the scale of the data rather than the specific compute resources utilized.
Benchmark Reproducibility
The technical report provides results for standard benchmarks such as MMLU (81.38), GSM8K, MATH, and LiveCodeBench. Evaluation settings, including sampling parameters (Temperature=0.6, TopP=0.95 for thinking mode) and prompt templates, are documented. However, the full evaluation code and the exact internal test sets used for 'thinking mode' validation are not fully public, and third-party verification for the newest Qwen3 variants is still emerging.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as a Qwen3 series model in both its system prompts and documentation. It maintains clear versioning between the base, instruct, and 'thinking' variants. There are no documented cases of the model claiming to be a competitor's product, and it is transparent about its dual-mode (thinking vs. non-thinking) capabilities.
License Clarity
The model is released under the Apache 2.0 license, which is a standard, permissive open-source license. This allows for commercial use, modification, and distribution without the restrictive 'custom' terms often found in other 'open' weights models. The licensing is consistent across the weights on Hugging Face and the official GitHub repository.
Hardware Footprint
Hardware requirements are well-documented for various deployment scenarios. Official documentation specifies VRAM needs for standard inference and provides guidance for ultra-long context (up to 1M tokens requiring ~240GB VRAM). It also notes the memory savings (approx. 10GB) when disabling specific multimodal components. Quantization support is mentioned for frameworks like llama.cpp and vLLM, though detailed accuracy-tradeoff curves for all quantization levels are not fully provided.
Versioning Drift
Alibaba uses a date-based versioning system (e.g., 2507 for July 2025 updates) and maintains a clear distinction between 'Base', 'Instruct', and 'Thinking' versions. While changelogs are provided on GitHub and Hugging Face, the frequency of 'silent' updates to the underlying API endpoints without corresponding weight version bumps remains a minor concern for long-term reproducibility.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online