Active Parameters
235B
Context Length
262.144K
Modality
Reasoning
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
25 Jul 2025
Knowledge Cutoff
-
Total Expert Parameters
22.0B
Number of Experts
128
Active Experts
8
Attention Structure
Multi-Head Attention
Hidden Dimension Size
-
Number of Layers
94
Attention Heads
64
Key-Value Heads
4
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
The Qwen3-235B-A22B-Thinking model is a specialized reasoning variant within Alibaba's Qwen3 family, engineered specifically for high-stakes cognitive tasks. Operating as a causal language model, it is purpose-built to execute multi-step logical deduction, intricate mathematical proofs, and advanced scientific analysis. Unlike general-purpose models that provide a choice between response modes, this 'Thinking' variant is permanently optimized for a reasoning-first approach. It generates internal chain-of-thought traces, often encapsulated within a system-defined thinking block, to maintain transparency and maximize accuracy in complex problem-solving environments.
Architecturally, the model utilizes a sparse Mixture-of-Experts (MoE) transformer framework, consisting of 128 total experts. During any single inference pass, the routing mechanism dynamically selects and activates 8 experts per token, resulting in approximately 22 billion active parameters from a total pool of 235 billion. This design ensures that the model provides the representational capacity of a massive parameter space while maintaining the computational profile and latency of a significantly smaller dense model. The system further incorporates Grouped-Query Attention (GQA) with a 64:4 head ratio and 94 transformer layers, balancing high-throughput inference with robust long-range dependency modeling.
Technical performance is characterized by a native context window of 262,144 tokens, facilitating the processing of extremely long documents and complex agentic workflows. To ensure stability and efficiency during large-scale deployments, the model employs RMSNorm for normalization and the SwiGLU activation function. For position encoding, it utilizes Rotary Positional Embeddings (RoPE), which provide improved generalization to varying sequence lengths. This specific 2507 iteration represents an enhanced version of the original Qwen3 reasoning architecture, featuring refined training on step-by-step analytical datasets to further improve performance in coding, STEM, and strategic planning domains.
The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.
Rank
#33
| Benchmark | Score | Rank |
|---|---|---|
General Knowledge MMLU | 0.91 | 🥇 1 |
Data Analysis LiveBench Data Analysis | 0.75 | 🥉 3 |
Professional Knowledge MMLU Pro | 0.84 | 4 |
Graduate-Level QA GPQA | 0.81 | 13 |
Reasoning LiveBench Reasoning | 0.59 | 23 |
Mathematics LiveBench Mathematics | 0.73 | 23 |
Coding LiveBench Coding | 0.69 | 29 |
Agentic Coding LiveBench Agentic | 0.07 | 39 |
Overall Rank
#33
Coding Rank
#62
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens