Active Parameters
235B
Context Length
131.072K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
29 Apr 2025
Knowledge Cutoff
-
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
128
Key-Value Heads
8
Attention Head Dimension
128
Position Embedding
ROPE
RoPE Theta
1,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
10,240
Number of Layers
100
FFN Intermediate Size (Dense)
1,536
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
151,936
Mixture of Experts
Total Expert Parameters
22.0B
Number of Experts
128
Active Experts
8
Shared Experts
-
FFN Intermediate Size (per Expert)
1,536
Dense Layers Before MoE
-
Qwen3-235B-A22B is a flagship Mixture-of-Experts (MoE) large language model developed by Alibaba Cloud, forming part of the Qwen3 series. Its primary purpose is to address high-performance computational linguistics tasks requiring advanced reasoning and comprehensive knowledge. This model is engineered for handling complex assignments such as sophisticated code generation, intricate mathematical problem-solving, and multi-step logical deduction. It is also designed to be highly effective in applications that necessitate processing of extended documents, managing multi-turn conversations, and analyzing enterprise-scale datasets.
The technical architecture of Qwen3-235B-A22B incorporates a unified framework that integrates both a 'thinking mode' and a 'non-thinking mode'. The thinking mode facilitates complex, multi-step reasoning by explicitly showing intermediate thought processes, while the non-thinking mode provides rapid, direct responses. This dual-mode design enables dynamic switching based on task complexity or user queries, allowing for adaptive allocation of computational resources during inference. The MoE architecture is characterized by its sparse activation mechanism, utilizing top-2 expert routing, where each input token is dynamically routed to its two most relevant experts chosen from a total of 128 experts. Despite a total parameter count of 235 billion, only 22 billion parameters are actively engaged during inference for any given input, contributing to efficiency. The model's foundation is built upon a pre-training corpus of approximately 36 trillion tokens, encompassing 119 languages and dialects. Architectural components include Grouped-Query Attention (GQA) for optimized attention mechanisms, Rotary Positional Embedding (RoPE) for position encoding, and the integration of Flash Attention for accelerated processing. Normalization is performed using pre-norm RMSNorm, and the activation function employed is SwiGLU.
The performance characteristics of Qwen3-235B-A22B highlight its capabilities in instruction following, logical reasoning, comprehensive text understanding, and proficiency across mathematics, science, and coding tasks. Its design prioritizes efficiency, with the MoE architecture significantly lowering the computational resources required per inference step, thereby reducing energy consumption and operational costs. The model supports a substantial context length, which enhances its ability to maintain coherence and retrieve relevant information over long sequences. The weights are made publicly available under the Apache 2.0 license, promoting widespread adoption and further research within the artificial intelligence community. This accessibility allows for deployment across various frameworks and platforms, including local development environments such as Ollama, LMStudio, and llama.cpp.
The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.
Rank
#95
| Benchmark | Score | Rank |
|---|---|---|
General Knowledge MMLU | 0.878 | 7 |
Coding Aider Coding | 0.60 | 15 |
Web Development WebDev Arena | 1422 | 16 |
Professional Knowledge MMLU Pro | 0.84 | 22 |
Graduate-Level QA GPQA | 0.775 | 32 |
Coding LiveBench Coding | 0.70 | 37 |
Reasoning LiveBench Reasoning | 0.58 | 40 |
Mathematics LiveBench Mathematics | 0.68 | 41 |
Agentic Coding LiveBench Agentic | 0.13 | 52 |
Data Analysis LiveBench Data Analysis | 0.45 | 53 |
Overall Rank
#95
Coding Rank
#42
Total Score
73
/ 100
Qwen3-235B-A22B exhibits a strong transparency profile regarding its technical architecture and licensing, supported by a detailed technical report and a permissive Apache 2.0 license. However, significant opacity remains concerning the specific composition of its 36-trillion-token training dataset and the total compute resources consumed during training. While hardware requirements and versioning are well-documented, the lack of detailed data provenance and environmental impact metrics are the primary weaknesses in its disclosure.
Architectural Provenance
The model's architecture is extensively documented in the Qwen3 Technical Report (arXiv:2505.09388). It is a sparse Mixture-of-Experts (MoE) transformer with 235B total parameters and 22B active parameters. Key components are explicitly named: 94 transformer layers, 128 experts with top-2 routing (8 experts activated per token in later variants), Grouped-Query Attention (GQA) with 64 heads for Q and 4 for KV, Rotary Positional Embedding (RoPE), and SwiGLU activation. The report details the pre-training and post-training stages, including the 'thinking' and 'non-thinking' mode integration.
Dataset Composition
While the scale of the pre-training corpus is disclosed (36 trillion tokens) and its multilingual breadth is stated (119 languages), the specific composition breakdown (e.g., percentage of web, code, books) is not provided. The documentation mentions 'diverse web sources and documents' and the use of Qwen2.5-VL for data extraction, but lacks a detailed categorical distribution or public access to the training data samples.
Tokenizer Integrity
The tokenizer is publicly available via Hugging Face and documented in the official 'Read the Docs'. It uses Byte Pair Encoding (BPE) with a vocabulary size of 151,646 tokens. The documentation explicitly describes the subword tokenization method and provides code snippets for implementation. It supports the claimed 119 languages and includes specific control tokens for the 'thinking' mode (<think> and </think>).
Parameter Density
Alibaba provides clear and consistent parameter counts: 235B total and 22B active. The architectural breakdown is detailed, specifying 94 layers and the expert routing mechanism (top-2 routing among 128 experts). The distinction between dense and sparse parameters is maintained across all official documentation, including the technical report and model cards.
Training Compute
Information regarding training compute is minimal. While the technical report mentions the use of 'extensive training' and 'reinforcement learning compute', it does not disclose the specific number of GPU/TPU hours, the hardware cluster size used for the flagship 235B model, or the total carbon footprint. Environmental impact data and specific training costs are conspicuously absent.
Benchmark Reproducibility
The technical report provides results for standard benchmarks (AIME, LiveCodeBench, MMLU-Pro) and specifies the versions used. Evaluation settings, such as temperature (0.6) and top_p (0.95) for thinking mode, are documented. However, the full evaluation code and the exact prompts for all benchmarks are not fully public, and third-party verification is limited to early leaderboard entries.
Identity Consistency
The model demonstrates high identity consistency. It is clearly versioned (e.g., Qwen3-235B-A22B-Instruct-2507) and correctly identifies its capabilities, such as the dual 'thinking' and 'non-thinking' modes. Documentation explicitly guides users on how the model identifies itself through chat templates and system prompts.
License Clarity
The model weights and code are released under the Apache 2.0 license, which is a standard, permissive open-source license. This is consistently stated across Hugging Face, GitHub, and the technical report. The license allows for commercial use, modification, and distribution without conflicting proprietary terms.
Hardware Footprint
Hardware requirements are well-documented for various quantization levels. Official and community sources (e.g., localai.computer, vllm-ascend) provide VRAM estimates: ~460GB for FP16, ~230GB for Q8, and ~115GB for Q4. The impact of context length on memory (YaRN for 128K) is also documented, providing clear guidance for deployment on both consumer and datacenter hardware.
Versioning Drift
Alibaba uses a clear versioning scheme (e.g., the '2507' suffix for July 2025 updates). Changelogs and blog posts document significant shifts, such as the decision to separate 'Instruct' and 'Thinking' models in the July update. While semantic versioning is not strictly followed in the traditional software sense, the date-based versioning provides a verifiable history of model iterations.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online