ApX logoApX logo

Qwen3 235B A22B Thinking

Active Parameters

235B

Context Length

262.144K

Modality

Reasoning

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

25 Jul 2025

Knowledge Cutoff

-

Technical Specifications

Total Expert Parameters

22.0B

Number of Experts

128

Active Experts

8

Attention Structure

Multi-Head Attention

Hidden Dimension Size

-

Number of Layers

94

Attention Heads

64

Key-Value Heads

4

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

Qwen3 235B A22B Thinking

The Qwen3-235B-A22B-Thinking model is a specialized reasoning variant within Alibaba's Qwen3 family, engineered specifically for high-stakes cognitive tasks. Operating as a causal language model, it is purpose-built to execute multi-step logical deduction, intricate mathematical proofs, and advanced scientific analysis. Unlike general-purpose models that provide a choice between response modes, this 'Thinking' variant is permanently optimized for a reasoning-first approach. It generates internal chain-of-thought traces, often encapsulated within a system-defined thinking block, to maintain transparency and maximize accuracy in complex problem-solving environments.

Architecturally, the model utilizes a sparse Mixture-of-Experts (MoE) transformer framework, consisting of 128 total experts. During any single inference pass, the routing mechanism dynamically selects and activates 8 experts per token, resulting in approximately 22 billion active parameters from a total pool of 235 billion. This design ensures that the model provides the representational capacity of a massive parameter space while maintaining the computational profile and latency of a significantly smaller dense model. The system further incorporates Grouped-Query Attention (GQA) with a 64:4 head ratio and 94 transformer layers, balancing high-throughput inference with robust long-range dependency modeling.

Technical performance is characterized by a native context window of 262,144 tokens, facilitating the processing of extremely long documents and complex agentic workflows. To ensure stability and efficiency during large-scale deployments, the model employs RMSNorm for normalization and the SwiGLU activation function. For position encoding, it utilizes Rotary Positional Embeddings (RoPE), which provide improved generalization to varying sequence lengths. This specific 2507 iteration represents an enhanced version of the original Qwen3 reasoning architecture, featuring refined training on step-by-step analytical datasets to further improve performance in coding, STEM, and strategic planning domains.

About Qwen 3

The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.


Other Qwen 3 Models

Evaluation Benchmarks

Rank

#33

BenchmarkScoreRank

General Knowledge

MMLU

0.91

🥇

1

0.75

🥉

3

Professional Knowledge

MMLU Pro

0.84

4

Graduate-Level QA

GPQA

0.81

13

0.59

23

0.73

23

0.69

29

Agentic Coding

LiveBench Agentic

0.07

39

Rankings

Overall Rank

#33

Coding Rank

#62

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
128k
256k

VRAM Required:

Recommended GPUs