ApX logoApX logo

Qwen3-30B-A3B

Active Parameters

30B

Context Length

131K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

29 Apr 2025

Knowledge Cutoff

Mar 2025

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

96

Key-Value Heads

8

Attention Head Dimension

128

Position Embedding

ROPE

RoPE Theta

1,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

Layer Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

60

FFN Intermediate Size (Dense)

768

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

151,936

Mixture of Experts

Total Expert Parameters

3.0B

Number of Experts

128

Active Experts

8

Shared Experts

-

FFN Intermediate Size (per Expert)

768

Dense Layers Before MoE

-

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 4.1k · Context: 131K · Vocab: 151.9kx 60 layersLayerNormPre-AttentionGrouped-Query Attention96Q / 8KV headsHead dim: 128+LayerNormPre-FFNSparse MoE FFN (8/128 experts)SwiGLUIntermediate: 768+Final LayerNormOutput Logits

Qwen3-30B-A3B

The Qwen3-30B-A3B model is a Mixture-of-Experts (MoE) language model developed by Alibaba, engineered to deliver high-performance inference with reduced computational costs. It features a total of 30.5 billion parameters, but employs a sparse activation strategy where only approximately 3.3 billion parameters are engaged per token. This design allows the model to maintain the broad knowledge and capabilities of a larger system while operating with the latency and resource profile of a significantly smaller dense architecture. It serves as a middle-tier solution within the Qwen3 family, balancing sophistication with operational efficiency.

Technically, the model is structured with 48 transformer layers and utilizes Grouped Query Attention (GQA) with 32 query heads and 4 key-value heads to optimize memory bandwidth and inference speed. The MoE component consists of 128 experts, with 8 experts selected via a routing mechanism for each token. A notable architectural innovation is the hybrid system that supports both a reasoning-heavy thinking mode for complex mathematical and logic tasks and a non-thinking mode for streamlined, general-purpose conversation. This flexibility is supported by training on a massive 36 trillion token corpus spanning 119 languages, incorporating advanced techniques such as Rotary Position Embedding (RoPE) and SwiGLU activation.

Designed for versatile deployment, Qwen3-30B-A3B excels in instruction following, code generation, and complex agentic workflows where it can integrate with external tools. The model supports a native context window of 32,768 tokens, which can be extended to 131,072 tokens using the YaRN (Yet another RoPE N) scaling method, and further iterations have pushed these limits to 256,000 tokens. Its robust multilingual foundation and optimized expert routing make it suitable for various downstream applications ranging from technical reasoning to creative content generation in professional environments.

About Qwen 3

The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.


Other Qwen 3 Models

Evaluation Benchmarks

Rank

#144

BenchmarkScoreRank

General Knowledge

MMLU

0.876

9

0.65

45

Web Development

WebDev Arena

1384

45

0.45

52

0.49

55

0.37

57

Agentic Coding

LiveBench Agentic

0.02

58

General Text

Text Arena

1327

76

Rankings

Overall Rank

#144

Coding Rank

#141

Model Integrity

Total Score

B+

75 / 100

Qwen3-30B-A3B Model Integrity Report

Total Score

75

/ 100

B+

Audit Note

The model exhibits a high level of transparency regarding its architectural design and parameter density, particularly in its clear disclosure of active versus total parameters for its Mixture-of-Experts structure. It is backed by a permissive Apache 2.0 license and detailed technical reporting on its unique hybrid reasoning modes. However, transparency is more limited regarding the specific composition of its 36-trillion-token training set and the total compute resources expended during its development.

Upstream

22.0 / 30

Architectural Provenance

8.0 / 10

The model's architecture is extensively documented in the Qwen3 Technical Report (arXiv:2505.09388). It is a Mixture-of-Experts (MoE) transformer with 48 layers, utilizing Grouped Query Attention (GQA) with 32 query heads and 4 KV heads. The MoE design features 128 experts with 8 active per token. Key technical components like SwiGLU activation, RoPE (Rotary Position Embeddings), and RMSNorm are explicitly detailed. The report also describes a unique 'thinking mode' hybrid system and a three-stage pre-training methodology (General, Reasoning, and Long-context).

Dataset Composition

5.0 / 10

Alibaba discloses that the model was trained on a 36 trillion token corpus spanning 119 languages. While the general categories of data are mentioned—including web data, books, PDFs, and synthetic data generated by previous Qwen models (Qwen2.5-VL for extraction, Qwen2.5-Math/Coder for synthetic generation)—there is no precise percentage breakdown of the dataset composition (e.g., exact ratios of code vs. web vs. books). The filtering and cleaning methodologies are described at a high level but lack granular technical specifics.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly available via the Hugging Face repository and official Qwen GitHub. It uses Byte Pair Encoding (BPE) with a vocabulary size of 151,936 tokens. It supports 119 languages and dialects, which is verified by the model's extensive multilingual benchmark performance. Documentation provides clear instructions for handling special tokens, including the <think> tags used in reasoning mode.

Model

29.0 / 40

Parameter Density

9.5 / 10

The model provides exemplary transparency regarding its parameter count. It explicitly states a total of 30.5 billion parameters, with a non-embedding parameter count of 29.9 billion. Crucially for an MoE model, it clearly discloses that only 3.3 billion parameters are active per token during inference. The architectural breakdown of 128 total experts and 8 activated experts is consistently reported across all official documentation.

Training Compute

4.0 / 10

While the technical report mentions the use of scaling laws to tune hyperparameters and the scale of the training (36T tokens), it lacks specific details on the total GPU/TPU hours consumed, the specific hardware clusters used for the full training run, or the estimated carbon footprint. The information provided is limited to the scale of the data rather than the specific compute resources utilized.

Benchmark Reproducibility

6.5 / 10

The technical report provides results for standard benchmarks such as MMLU (81.38), GSM8K, MATH, and LiveCodeBench. Evaluation settings, including sampling parameters (Temperature=0.6, TopP=0.95 for thinking mode) and prompt templates, are documented. However, the full evaluation code and the exact internal test sets used for 'thinking mode' validation are not fully public, and third-party verification for the newest Qwen3 variants is still emerging.

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying itself as a Qwen3 series model in both its system prompts and documentation. It maintains clear versioning between the base, instruct, and 'thinking' variants. There are no documented cases of the model claiming to be a competitor's product, and it is transparent about its dual-mode (thinking vs. non-thinking) capabilities.

Downstream

23.5 / 30

License Clarity

10.0 / 10

The model is released under the Apache 2.0 license, which is a standard, permissive open-source license. This allows for commercial use, modification, and distribution without the restrictive 'custom' terms often found in other 'open' weights models. The licensing is consistent across the weights on Hugging Face and the official GitHub repository.

Hardware Footprint

7.5 / 10

Hardware requirements are well-documented for various deployment scenarios. Official documentation specifies VRAM needs for standard inference and provides guidance for ultra-long context (up to 1M tokens requiring ~240GB VRAM). It also notes the memory savings (approx. 10GB) when disabling specific multimodal components. Quantization support is mentioned for frameworks like llama.cpp and vLLM, though detailed accuracy-tradeoff curves for all quantization levels are not fully provided.

Versioning Drift

6.0 / 10

Alibaba uses a date-based versioning system (e.g., 2507 for July 2025 updates) and maintains a clear distinction between 'Base', 'Instruct', and 'Thinking' versions. While changelogs are provided on GitHub and Hugging Face, the frequency of 'silent' updates to the underlying API endpoints without corresponding weight version bumps remains a minor concern for long-term reproducibility.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs

Qwen3-30B-A3B: Specifications and GPU VRAM Requirements