NVIDIA Nemotron 3 Nano 30B-A3B: Specifications and GPU VRAM Requirements

NVIDIA Nemotron 3 Nano 30B-A3B

Open Source

Open Weights

Active Parameters

3.5B

Context Length

1,000K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

NVIDIA Open Model License

Release Date

15 Dec 2025

Knowledge Cutoff

Nov 2025

Technical Specifications

Total Expert Parameters

30.0B

Number of Experts

129

Active Experts

Attention Structure

Multi-Head Attention

Hidden Dimension Size

2688

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

ReLU2

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

NVIDIA Nemotron 3 Nano 30B-A3B

NVIDIA Nemotron 3 Nano 30B-A3B is an advanced large language model meticulously developed by NVIDIA, integrating a hybrid Mixture-of-Experts (MoE) architecture with both Mamba-2 state-space model layers and Transformer attention layers. This sophisticated design is engineered to address the computational trade-offs traditionally associated with long-context processing while maintaining high accuracy across diverse tasks. The model aims to provide a unified solution for both explicit reasoning and general non-reasoning applications, with configurable capabilities to adapt its reasoning depth based on task requirements.

Architecturally, the Nemotron 3 Nano 30B-A3B comprises a total of 52 layers. This includes 23 Mamba-2 layers, which are particularly adept at efficient sequential processing and managing extended contexts, and 23 Mixture-of-Experts layers. Each MoE layer is structured with 128 routed experts augmented by 1 shared expert, and employs a mechanism that activates 6 experts per token during processing to enhance computational efficiency. Additionally, the model incorporates 6 Grouped-Query Attention (GQA) layers, providing robust attentional mechanisms for fine-grained information routing. The model utilizes a hidden dimension size of 2688, employs squared ReLU (ReLU2) as its activation function, and incorporates RMSNorm for normalization stability.

Designed for versatile deployment and robust performance, Nemotron 3 Nano 30B-A3B supports a substantial context length of up to 1 million tokens, enabling it to process extensive inputs for complex multi-step workflows, agentic systems, and retrieval-augmented generation (RAG) applications. The model is trained on an extensive corpus of approximately 25 trillion tokens, supporting multilingual interactions across English, Spanish, French, German, Italian, and Japanese, alongside numerous programming languages. This foundation positions the model as a capable component for building specialized AI agents, chatbots, and systems requiring efficient, accurate, and scalable language understanding and generation capabilities.

About Nemotron 3

Nemotron 3 is NVIDIA's family of open models delivering leading efficiency and accuracy for agentic AI applications. Utilizing hybrid Mamba-Transformer MoE architecture with Latent MoE design, the models support up to 1M token context and feature Multi-Token Prediction for improved generation efficiency. The Nano variant outperforms comparable models while maintaining extreme cost-efficiency.

Other Nemotron 3 Models

No related models available

Evaluation Benchmarks

Rank

#65

Benchmark	Score	Rank
Professional Knowledge MMLU Pro	0.78	15
Web Development WebDev Arena	1317	38

Rankings

Overall Rank

#65

Coding Rank

#53

Model Transparency

Total Score

B+

77 / 100

Upstream

24.5 / 30

Model

31.0 / 40

Downstream

21.0 / 30

NVIDIA Nemotron 3 Nano 30B-A3B Transparency Report

Total Score

/ 100

B+

Audit Note

Nemotron 3 Nano 30B-A3B sets a high standard for transparency in the MoE category, particularly through its detailed architectural disclosures and the provision of a complete reproducibility SDK for benchmarks. The model's clear distinction between active and total parameters is exemplary. However, transparency is slightly hampered by the use of a custom proprietary license and a lack of detailed training compute and environmental impact data.

Upstream

24.5 / 30

Architectural Provenance

9.0 / 10

NVIDIA provides exemplary documentation for the Nemotron 3 Nano architecture. The model is explicitly described as a hybrid Mamba-2 and Transformer Mixture-of-Experts (MoE) model. Technical reports and white papers detail the specific layer composition (23 Mamba-2 layers, 23 MoE layers, and 6 GQA layers). The training methodology, including the use of the Warmup-Stable-Decay (WSD) learning rate schedule and the two-phase pre-training curriculum (diversity phase followed by high-quality phase), is thoroughly documented. The transition from previous generations (Nemotron-H and Nemotron 2 Nano) is also clearly explained.

Dataset Composition

7.5 / 10

The model was trained on a massive 25 trillion token corpus. NVIDIA discloses major data categories including web (Common Crawl), code (GitHub), math, and science. Specific datasets like Nemotron-CC-Code-v1 (427.92B tokens) and the InfiniByte cross-domain dataset are named. While exact percentage breakdowns for all 141 datasets are not provided in a single table, the technical report describes the curation process ('efficient data' paradigm) and the use of synthetic data generated via Lynx pipelines and LLM-based filtration (e.g., using Qwen3-30B-A3B for QA pair generation).

Tokenizer Integrity

8.0 / 10

The tokenizer is publicly available via Hugging Face and the NeMo framework. It supports 20 languages and 43 programming languages, aligning with the model's claimed capabilities. Documentation specifies the use of special tokens for reasoning (<think> and </think> with IDs 12 and 13). While the exact vocabulary size is not prominently featured in marketing summaries, it is verifiable through the provided AutoTokenizer implementation and model configuration files.

Model

31.0 / 40

Parameter Density

9.5 / 10

NVIDIA is highly transparent regarding parameter density, explicitly distinguishing between total and active parameters. The model has 31.6B total parameters with ~3.2B active per forward pass (3.6B including embeddings). The MoE structure is detailed as having 128 routed experts plus 1 shared expert per layer, with a routing mechanism that activates 6 experts per token. This level of detail prevents the common 'parameter inflation' seen in other MoE disclosures.

Training Compute

4.0 / 10

While NVIDIA specifies the hardware used for inference (H100, A100, H200) and the software framework (Megatron-LM, NeMo), it provides limited information on the total training compute budget in terms of GPU-hours. The training duration (September to December 2025) is known, and the batch size (3072) is disclosed, but a formal carbon footprint calculation or total cost estimate is conspicuously absent from the public technical reports.

Benchmark Reproducibility

8.5 / 10

Reproducibility is a core focus of the Nemotron 3 release. NVIDIA provides the NeMo Evaluator SDK, which includes the exact YAML configurations, prompt templates, and sampling parameters used for the model card benchmarks. Evaluation results are compared against peers (Qwen3, GPT-OSS) with specified versions. The disclosure of 'Reasoning ON/OFF' modes and their impact on benchmark scores (e.g., AIME 2025 with and without tools) demonstrates a high commitment to verifiable performance claims.

Identity Consistency

9.0 / 10

The model exhibits strong identity consistency, correctly identifying its version and capabilities (such as its 1M token context window and reasoning traces). It does not attempt to mimic competitor identities and is transparent about its 'Thinking' budget and configurable reasoning depth. The distinction between the 'Base' and 'Instruct' (post-trained) versions is clearly maintained across all documentation.

Downstream

21.0 / 30

License Clarity

7.0 / 10

The model is released under the 'NVIDIA Open Model License'. This is a custom permissive license that explicitly allows commercial use and the creation of derivative works. However, it is not a standard OSI-approved license like Apache 2.0 or MIT, and it contains specific clauses regarding NVIDIA's 'Trustworthy AI' terms. While clear, the use of a proprietary license instead of a standard open-source one introduces some legal complexity for users.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented for various configurations. NVIDIA provides VRAM estimates for FP16, INT8, and INT4 (e.g., ~62GB for FP16 at 1k context). It explicitly warns about the memory scaling of the 1M token context window, noting that 24GB consumer cards (like the RTX 4090) can run the model but may crash if the context is set to the full 1M limit without significant quantization. Support for FP8 and its accuracy trade-offs (99% recovery) is also documented.

Versioning Drift

6.0 / 10

NVIDIA uses a clear naming convention (Nemotron 3 Nano 30B-A3B) and provides data cutoff dates (June 2025 for pre-training, November 2025 for post-training). While a formal semantic versioning changelog for weight updates is not as robust as software versioning, the release of specific checkpoints (BF16, FP8) and the integration with the NeMo framework versioning (e.g., NeMo 25.11.01) provides a reasonable track for developers.

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

488k

977k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Read the Paper Download Weights Source Code