Active Parameters
3.5B
Context Length
1,000K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
NVIDIA Open Model License
Release Date
15 Dec 2025
Knowledge Cutoff
Nov 2025
Total Expert Parameters
30.0B
Number of Experts
129
Active Experts
6
Attention Structure
Multi-Head Attention
Hidden Dimension Size
2688
Number of Layers
52
Attention Heads
32
Key-Value Heads
2
Activation Function
ReLU2
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
NVIDIA Nemotron 3 Nano 30B-A3B is an advanced large language model meticulously developed by NVIDIA, integrating a hybrid Mixture-of-Experts (MoE) architecture with both Mamba-2 state-space model layers and Transformer attention layers. This sophisticated design is engineered to address the computational trade-offs traditionally associated with long-context processing while maintaining high accuracy across diverse tasks. The model aims to provide a unified solution for both explicit reasoning and general non-reasoning applications, with configurable capabilities to adapt its reasoning depth based on task requirements.
Architecturally, the Nemotron 3 Nano 30B-A3B comprises a total of 52 layers. This includes 23 Mamba-2 layers, which are particularly adept at efficient sequential processing and managing extended contexts, and 23 Mixture-of-Experts layers. Each MoE layer is structured with 128 routed experts augmented by 1 shared expert, and employs a mechanism that activates 6 experts per token during processing to enhance computational efficiency. Additionally, the model incorporates 6 Grouped-Query Attention (GQA) layers, providing robust attentional mechanisms for fine-grained information routing. The model utilizes a hidden dimension size of 2688, employs squared ReLU (ReLU2) as its activation function, and incorporates RMSNorm for normalization stability.
Designed for versatile deployment and robust performance, Nemotron 3 Nano 30B-A3B supports a substantial context length of up to 1 million tokens, enabling it to process extensive inputs for complex multi-step workflows, agentic systems, and retrieval-augmented generation (RAG) applications. The model is trained on an extensive corpus of approximately 25 trillion tokens, supporting multilingual interactions across English, Spanish, French, German, Italian, and Japanese, alongside numerous programming languages. This foundation positions the model as a capable component for building specialized AI agents, chatbots, and systems requiring efficient, accurate, and scalable language understanding and generation capabilities.
Nemotron 3 is NVIDIA's family of open models delivering leading efficiency and accuracy for agentic AI applications. Utilizing hybrid Mamba-Transformer MoE architecture with Latent MoE design, the models support up to 1M token context and feature Multi-Token Prediction for improved generation efficiency. The Nano variant outperforms comparable models while maintaining extreme cost-efficiency.
Rank
#65
| Benchmark | Score | Rank |
|---|---|---|
Professional Knowledge MMLU Pro | 0.78 | 15 |
Web Development WebDev Arena | 1317 | 38 |
Overall Rank
#65
Coding Rank
#53
Total Score
77
/ 100
Nemotron 3 Nano 30B-A3B sets a high standard for transparency in the MoE category, particularly through its detailed architectural disclosures and the provision of a complete reproducibility SDK for benchmarks. The model's clear distinction between active and total parameters is exemplary. However, transparency is slightly hampered by the use of a custom proprietary license and a lack of detailed training compute and environmental impact data.
Architectural Provenance
NVIDIA provides exemplary documentation for the Nemotron 3 Nano architecture. The model is explicitly described as a hybrid Mamba-2 and Transformer Mixture-of-Experts (MoE) model. Technical reports and white papers detail the specific layer composition (23 Mamba-2 layers, 23 MoE layers, and 6 GQA layers). The training methodology, including the use of the Warmup-Stable-Decay (WSD) learning rate schedule and the two-phase pre-training curriculum (diversity phase followed by high-quality phase), is thoroughly documented. The transition from previous generations (Nemotron-H and Nemotron 2 Nano) is also clearly explained.
Dataset Composition
The model was trained on a massive 25 trillion token corpus. NVIDIA discloses major data categories including web (Common Crawl), code (GitHub), math, and science. Specific datasets like Nemotron-CC-Code-v1 (427.92B tokens) and the InfiniByte cross-domain dataset are named. While exact percentage breakdowns for all 141 datasets are not provided in a single table, the technical report describes the curation process ('efficient data' paradigm) and the use of synthetic data generated via Lynx pipelines and LLM-based filtration (e.g., using Qwen3-30B-A3B for QA pair generation).
Tokenizer Integrity
The tokenizer is publicly available via Hugging Face and the NeMo framework. It supports 20 languages and 43 programming languages, aligning with the model's claimed capabilities. Documentation specifies the use of special tokens for reasoning (<think> and </think> with IDs 12 and 13). While the exact vocabulary size is not prominently featured in marketing summaries, it is verifiable through the provided AutoTokenizer implementation and model configuration files.
Parameter Density
NVIDIA is highly transparent regarding parameter density, explicitly distinguishing between total and active parameters. The model has 31.6B total parameters with ~3.2B active per forward pass (3.6B including embeddings). The MoE structure is detailed as having 128 routed experts plus 1 shared expert per layer, with a routing mechanism that activates 6 experts per token. This level of detail prevents the common 'parameter inflation' seen in other MoE disclosures.
Training Compute
While NVIDIA specifies the hardware used for inference (H100, A100, H200) and the software framework (Megatron-LM, NeMo), it provides limited information on the total training compute budget in terms of GPU-hours. The training duration (September to December 2025) is known, and the batch size (3072) is disclosed, but a formal carbon footprint calculation or total cost estimate is conspicuously absent from the public technical reports.
Benchmark Reproducibility
Reproducibility is a core focus of the Nemotron 3 release. NVIDIA provides the NeMo Evaluator SDK, which includes the exact YAML configurations, prompt templates, and sampling parameters used for the model card benchmarks. Evaluation results are compared against peers (Qwen3, GPT-OSS) with specified versions. The disclosure of 'Reasoning ON/OFF' modes and their impact on benchmark scores (e.g., AIME 2025 with and without tools) demonstrates a high commitment to verifiable performance claims.
Identity Consistency
The model exhibits strong identity consistency, correctly identifying its version and capabilities (such as its 1M token context window and reasoning traces). It does not attempt to mimic competitor identities and is transparent about its 'Thinking' budget and configurable reasoning depth. The distinction between the 'Base' and 'Instruct' (post-trained) versions is clearly maintained across all documentation.
License Clarity
The model is released under the 'NVIDIA Open Model License'. This is a custom permissive license that explicitly allows commercial use and the creation of derivative works. However, it is not a standard OSI-approved license like Apache 2.0 or MIT, and it contains specific clauses regarding NVIDIA's 'Trustworthy AI' terms. While clear, the use of a proprietary license instead of a standard open-source one introduces some legal complexity for users.
Hardware Footprint
Hardware requirements are well-documented for various configurations. NVIDIA provides VRAM estimates for FP16, INT8, and INT4 (e.g., ~62GB for FP16 at 1k context). It explicitly warns about the memory scaling of the 1M token context window, noting that 24GB consumer cards (like the RTX 4090) can run the model but may crash if the context is set to the full 1M limit without significant quantization. Support for FP8 and its accuracy trade-offs (99% recovery) is also documented.
Versioning Drift
NVIDIA uses a clear naming convention (Nemotron 3 Nano 30B-A3B) and provides data cutoff dates (June 2025 for pre-training, November 2025 for post-training). While a formal semantic versioning changelog for weight updates is not as robust as software versioning, the release of specific checkpoints (BF16, FP8) and the integration with the NeMo framework versioning (e.g., NeMo 25.11.01) provides a reasonable track for developers.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens