Active Parameters
480B
Context Length
262.144K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
22 Jul 2025
Knowledge Cutoff
Dec 2024
Total Expert Parameters
35.0B
Number of Experts
160
Active Experts
8
Attention Structure
Multi-Head Attention
Hidden Dimension Size
6144
Number of Layers
62
Attention Heads
96
Key-Value Heads
8
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
Qwen3 Coder 480B A35B is Alibaba's advanced agentic artificial intelligence model, specifically engineered for high-performance software development and autonomous coding workflows. As a specialized variant of the Qwen 3 family, it is designed to manage complex multi-turn programming tasks, including comprehensive repository analysis, cross-file reasoning, and automated pull request generation. The model serves as the primary engine for autonomous software engineering, enabling deep integration with developer tools and terminal-based agents like Qwen Code.
Architecturally, the model utilizes a sparse Mixture-of-Experts (MoE) decoder-only transformer framework. It comprises a total of 480 billion parameters, while maintaining computational efficiency by activating only 35 billion parameters per inference query. This configuration employs 160 total experts, with 8 active experts selected via a gating mechanism for each token. The underlying structure features 62 transformer layers and incorporates Grouped Query Attention (GQA) with 96 query heads and 8 key-value heads to optimize memory bandwidth and inference speed. It utilizes Rotary Position Embeddings (RoPE) and is optimized for long-horizon context through techniques such as YaRN, supporting a native context window of 262,144 tokens that can be extended up to one million.
The model is trained on a massive dataset of 7.5 trillion tokens, with a 70% concentration on source code and technical content across multiple programming languages including Python, JavaScript, C++, and Rust. Its post-training phase leverages long-horizon reinforcement learning, specifically Agent RL and Code RL, to improve multi-step planning and interaction with external tools such as browsers and CLI environments. This specialization allows the model to function as a sophisticated coding agent capable of executing complex engineering tasks and managing entire codebases with high precision.
The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.
Rank
#26
| Benchmark | Score | Rank |
|---|---|---|
Web Development WebDev Arena | 1386 | 22 |
Overall Rank
#26
Coding Rank
#31
Total Score
68
/ 100
Qwen3-Coder 480B A35B demonstrates high transparency in its architectural design and licensing, providing clear distinctions between total and active parameters in its Mixture-of-Experts framework. However, the profile is weakened by a lack of disclosure regarding training compute resources and environmental impact. While the model is accessible and well-documented for deployment, reproducibility concerns regarding certain benchmark claims suggest a need for more granular evaluation transparency.
Architectural Provenance
The model's architecture is extensively documented in the Qwen3 technical report (arXiv:2505.09388) and official blog posts. It is a sparse Mixture-of-Experts (MoE) decoder-only transformer with 480B total and 35B active parameters. Specific details include 62 layers, Grouped Query Attention (GQA) with 96 query and 8 KV heads, and the use of Rotary Position Embeddings (RoPE) with YaRN for context extension. The training methodology, including the use of Agent RL and Code RL for post-training, is clearly described.
Dataset Composition
Alibaba discloses the total token count (7.5 trillion) and a high-level composition breakdown (70% code, 30% general text/math). While it names specific programming languages (Python, JavaScript, C++, Rust) and mentions the use of synthetic data cleaned by Qwen2.5-Coder, it lacks a granular breakdown of specific web or book sources. The filtering and cleaning methodologies are mentioned but not provided in exhaustive detail.
Tokenizer Integrity
The model uses the Qwen3 tokenizer with a vocabulary size of 151,936. The tokenizer is publicly available via the official GitHub repository and Hugging Face. Documentation specifies the use of ChatML templates and a new tool parser (qwen3coder_tool_parser.py) to maintain consistency with the Qwen3 family's agentic capabilities. Tokenization alignment with claimed language support is verifiable through the provided code snippets.
Parameter Density
The model provides exemplary clarity regarding its MoE structure, explicitly stating 480B total parameters and 35B active parameters per token. It further details the expert configuration (160 total experts, 8 active per forward pass). Architectural dimensions like hidden size (6144) and intermediate sizes are publicly documented in the model card and technical reports.
Training Compute
Information regarding training compute is largely absent. While the technical report mentions the scale of the pre-training (7.5T tokens), it does not disclose specific GPU/TPU hours, hardware quantities used for the full training run, or the total carbon footprint. Some third-party sources speculate on the massive resource requirements, but official verifiable metrics are missing.
Benchmark Reproducibility
While the model provides scores for standard benchmarks like SWE-Bench (69.6%) and HumanEval, independent researchers have reported difficulties reproducing certain flagship claims (notably ARC-AGI-1). Evaluation code for the 'Qwen Code' CLI is public, but the exact prompts and few-shot examples used for all reported benchmarks are not fully transparent, leading to skepticism in the research community.
Identity Consistency
The model exhibits strong identity consistency, correctly identifying itself as a Qwen3-Coder variant in official documentation and API responses. It is transparent about its 'non-thinking' mode (unlike the Qwen3-Max-Thinking variant) and its specific optimization for agentic coding tasks. Versioning is clear across the Qwen3 suite.
License Clarity
The model and its weights are released under the Apache 2.0 license, which is a highly permissive, standard open-source license. The terms for commercial use, modification, and distribution are explicitly clear and consistent across GitHub, Hugging Face, and official blog posts. There are no conflicting proprietary 'look-but-don't-touch' clauses.
Hardware Footprint
VRAM requirements are well-documented for the full model and common quantizations (e.g., 2-bit GGUF for 1M context). The model card provides guidance on using the latest 'transformers' library to avoid architectural errors. However, detailed memory scaling charts for varying batch sizes and the specific accuracy tradeoffs for extreme quantizations (like 2-bit) are primarily community-driven rather than officially documented.
Versioning Drift
The model uses clear naming conventions (e.g., Qwen3-Coder-480B-A35B-Instruct), but a formal, detailed changelog for weight updates or silent 'alignment' patches is not readily accessible. While major releases are well-publicized, the tracking of minor iterative drifts in model behavior over time lacks a centralized, transparent repository.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens