ApX logoApX logo

Qwen3 Next 80B A3B

Active Parameters

80B

Context Length

66K

Modality

Reasoning

Architecture

Mixture of Experts (MoE)

License

Apache-2.0

Release Date

1 Feb 2026

Knowledge Cutoff

Jun 2025

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

16

Key-Value Heads

2

Attention Head Dimension

256

Position Embedding

Absolute Position Embedding

RoPE Theta

10,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

2,048

Number of Layers

48

FFN Intermediate Size (Dense)

512

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

151,936

Mixture of Experts

Total Expert Parameters

79.0B

Number of Experts

512

Active Experts

10

Shared Experts

-

FFN Intermediate Size (per Expert)

512

Dense Layers Before MoE

-

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 2k · Context: 66K · Vocab: 151.9kx 48 layersRMSNormPre-AttentionMulti-Head Attention16Q / 2KV headsHead dim: 256+RMSNormPre-FFNSparse MoE FFN (10/512 experts)SwiGLUIntermediate: 512+Final RMSNormOutput Logits

Qwen3 Next 80B A3B

Qwen3-Next-80B-A3B is a high-capacity sparse Mixture-of-Experts (MoE) foundation model developed by Alibaba's Qwen team. It belongs to the next-generation Qwen3-Next series, specifically designed to address the computational demands of long-context sequence modeling and large-scale parameter efficiency. The model features a unique hybrid attention mechanism that integrates Gated DeltaNet with Gated Attention, allowing the system to maintain high performance across extended token sequences while significantly reducing the quadratic complexity typically associated with standard Transformer architectures.

The technical architecture employs a high-sparsity MoE layout consisting of 48 layers with a hidden dimension of 2048. While the model contains 80 billion total parameters, its gating mechanism activates only approximately 3 billion parameters per token during inference. This sparse activation strategy, combined with a total of 512 experts and a multi-token prediction (MTP) objective, facilitates improved throughput and reduced FLOPs per token. The model also incorporates stability-focused architectural refinements, such as zero-centered and weight-decayed layer normalization, to ensure robust convergence during both pre-training on 15 trillion tokens and subsequent reinforcement learning stages.

Optimized for complex reasoning and agentic workflows, Qwen3-Next-80B-A3B is capable of processing a native context window of 262,144 tokens, which can be extended to over 1 million tokens using specialized scaling techniques like YaRN. Its primary use cases include multi-step logical analysis, mathematical proofs, and code synthesis. By separating the 'Thinking' variant, which outputs structured reasoning traces, from the standard 'Instruct' variant, the model provides specialized paths for either high-efficiency general-purpose interaction or intensive, transparent problem-solving tasks.

About Qwen 3

The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.


Other Qwen 3 Models

Evaluation Benchmarks

Rank

#132

BenchmarkScoreRank

0.74

31

Graduate-Level QA

GPQA

0.772

33

Web Development

WebDev Arena

1402

35

0.50

36

0.68

41

0.58

42

General Text

Text Arena

1402

51

Agentic Coding

LiveBench Agentic

0.10

53

Professional Knowledge

MMLU Pro

0.83

56

Rankings

Overall Rank

#132

Coding Rank

#77

Model Integrity

Total Score

B+

72 / 100

Qwen3 Next 80B A3B Model Integrity Report

Total Score

72

/ 100

B+

Audit Note

Qwen3-Next-80B-A3B exhibits strong transparency in its architectural design and parameter density, providing clear distinctions between total and active parameters. Its use of a standard Apache 2.0 license and detailed hardware requirements facilitates accessibility for developers. However, significant gaps remain regarding the specific composition of its 15-trillion-token training set and the disclosure of absolute compute resources and environmental impact data.

Upstream

21.5 / 30

Architectural Provenance

8.0 / 10

The model's architecture is extensively documented in the Qwen3 Technical Report (arXiv:2505.09388) and official model cards. It is a hybrid Transformer-Mamba system (Qwen3-Next) utilizing a combination of Gated DeltaNet (linear attention) and Gated Attention (standard attention) in a 3:1 ratio across 48 layers. The pre-training methodology, including the use of Multi-Token Prediction (MTP) and stability optimizations like zero-centered RMSNorm, is clearly described. While the transition from the standard Qwen3 architecture is well-explained, some internal routing logic for the 512 experts remains proprietary.

Dataset Composition

4.5 / 10

Alibaba discloses that the model was trained on a 15 trillion token subset of the 36 trillion token Qwen3 corpus. While the total token count and the 'carefully curated' nature of the subset are mentioned, there is a lack of granular breakdown regarding specific data sources (e.g., exact percentages of web, code, or books). The data collection and filtering methodologies are described in general terms rather than with reproducible specifics, and no sample data or detailed source lists are provided.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly accessible via the Hugging Face 'transformers' library and official repositories. It features a vocabulary size of 151,936 tokens, which is explicitly stated in the technical documentation and configuration files. The tokenization approach is consistent with the Qwen family's multilingual support (119 languages), and the vocabulary is verified to align with the training data requirements for high-efficiency long-context modeling.

Model

27.5 / 40

Parameter Density

8.5 / 10

The model provides exemplary transparency regarding its sparse MoE architecture. It clearly distinguishes between the 80 billion total parameters and the ~3 billion active parameters per token (an activation ratio of approximately 3.7%). The architectural breakdown (48 layers, 512 experts, 10 routed + 1 shared expert) is precisely documented. This prevents the common 'parameter inflation' marketing trap by being upfront about actual inference compute requirements.

Training Compute

3.5 / 10

While Alibaba provides relative compute metrics—stating the model required less than 10% of the training cost of the dense Qwen3-32B and 80% of the GPU hours of Qwen3-30B-A3B—it fails to disclose absolute hardware hours (e.g., H100 hours), specific hardware cluster configurations, or the total carbon footprint. The lack of concrete environmental impact data and specific energy consumption figures results in a lower score.

Benchmark Reproducibility

6.0 / 10

Alibaba provides a wide array of benchmark results (MMLU-Pro, GPQA, AIME25, etc.) and specifies the use of GPT-4.1 as an evaluator for win rates to aid reproducibility. However, the exact prompts and few-shot examples used for all evaluations are not fully disclosed in a centralized, reproducible repository. While some evaluation code is available via 'EvalScope', the full pipeline for third-party verification of all claimed scores is missing.

Identity Consistency

9.5 / 10

The model demonstrates high identity consistency, correctly identifying itself as part of the Qwen3-Next series. It maintains a clear distinction between its 'Instruct' and 'Thinking' variants, with the latter explicitly outputting structured reasoning traces. There are no reported instances of the model claiming to be a competitor's product or misrepresenting its versioning during standard interaction.

Downstream

22.5 / 30

License Clarity

10.0 / 10

The model is released under the Apache 2.0 license, which is a standard, highly permissive open-source license. The terms for commercial use, modification, and distribution are clear and lack the restrictive 'monthly active user' or 'proprietary platform' clauses often found in other 'open' weights releases. Documentation across Hugging Face, ModelScope, and GitHub consistently cites this license.

Hardware Footprint

7.5 / 10

Hardware requirements are well-documented for various precisions. Official documentation and community guides (e.g., on Hugging Face and NVIDIA NIM) provide specific VRAM requirements for FP16 (~160GB), FP8 (~76-80GB), and INT4 (~48-50GB). The impact of context length on VRAM (KV cache scaling) is also addressed, though some users report stability issues at maximum context lengths that aren't fully detailed in the primary documentation.

Versioning Drift

5.0 / 10

The model uses a versioning system (Qwen3-Next-80B-A3B), but there is no public, detailed changelog tracking minor weight updates or 'silent' alignment tuning. While the release date and variant types are clear, the infrastructure for tracking performance drift over time or accessing specific historical checkpoints beyond the major releases is not robustly implemented.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
32k
64k

VRAM Required:

Recommended GPUs