Parameters
3B
Context Length
128K
Modality
Text
Architecture
Dense
License
Apache 2.0
Release Date
8 Jul 2025
Knowledge Cutoff
-
Attention
Attention Structure
Multi-Head Attention
Attention Heads
16
Key-Value Heads
4
Attention Head Dimension
-
Position Embedding
Absolute Position Embedding
RoPE Theta
5,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
Swish
Dimensions
Hidden Dimension Size
2,048
Number of Layers
36
FFN Intermediate Size (Dense)
11,008
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
128,256
The SmolLM3-3B model, developed by Hugging Face, represents a compact yet highly capable large language model (LLM) within the 'Smol' family, specifically engineered for efficiency and performance in resource-constrained environments. This pretrained, open-weights base model integrates multilingual understanding, extended context processing, and dual-mode reasoning capabilities within a 3-billion-parameter footprint. Its design aims to democratize advanced AI by providing a powerful solution that can operate effectively on edge devices, mobile applications, and systems with limited computational resources. The model is part of a broader initiative to create lightweight yet impactful AI solutions, making sophisticated language understanding and generation more accessible.
Architecturally, SmolLM3-3B is a decoder-only Transformer model, building upon the foundational designs of models like Llama while incorporating specialized optimizations. Key innovations include the adoption of Grouped Query Attention (GQA), which utilizes 4 key-value heads to significantly reduce the KV cache size during inference without compromising performance, compared to traditional multi-head attention. It also features No Positional Encoding (NoPE), a modification where rotary position embeddings (RoPE) are selectively removed from every fourth layer, enhancing long-context performance. The model comprises 36 hidden layers with a hidden dimension size of 2048 and 16 attention heads. Input and output embeddings are tied to further reduce the memory footprint.
The training regimen for SmolLM3-3B involved a three-stage curriculum on an extensive 11.2 trillion tokens, drawing from diverse public datasets covering web content, code, mathematics, and reasoning data. This comprehensive pretraining establishes robust multilingual and general-purpose capabilities. The model's context length is natively 64,000 tokens, which is further extended to 128,000 tokens through YaRN extrapolation. SmolLM3-3B supports advanced functionalities such as tool calling using structured schemas (XML and Python tools), enabling its integration into complex agent workflows. Its design focuses on delivering competitive performance in areas like reasoning, knowledge retention, and multilingual tasks, positioning it for applications requiring efficient, high-quality language processing on various platforms.
SmolLM open-weight language models (e.g. SmolLM3)
Rank
#71
No evaluation benchmarks for SmolLM3 3B available.
Overall Rank
#71
Coding Rank
-
Total Score
83
/ 100
SmolLM3-3B demonstrates a high standard of transparency, particularly regarding its architectural modifications and the specific composition of its 11-trillion-token training corpus. The model's openness is bolstered by the use of a permissive Apache 2.0 license and the disclosure of specific training compute resources. While benchmark reproducibility could be further streamlined with more explicit prompt documentation, the overall profile is exemplary for an open-weights release.
Architectural Provenance
The model architecture is extensively documented as a decoder-only Transformer based on the Llama design with specific, well-defined modifications. These include Grouped Query Attention (GQA) with 4 key-value heads and a unique 'No Positional Encoding' (NoPE) approach applied in a 3:1 layer ratio. Technical specifications such as 36 hidden layers, a hidden dimension of 2048, and tied embeddings are publicly available. The training methodology is described as a three-stage curriculum (Stable, Mid-training, and Post-training) with clear objectives for each phase.
Dataset Composition
Hugging Face has provided a high level of transparency regarding the 11.2 trillion token training corpus. The data mixture is broken down by stage (e.g., Stage 1: 85% web, 12% multilingual) and specific public datasets are named, including FineWeb-Edu, DCLM, FineWeb2, and The Stack. The transition between stages and the inclusion of specific reasoning datasets like OpenMathReasoning and synthetic data from Qwen3-32B are documented. While the exact per-file filtering code isn't fully public, the methodology and ratios are exemplary for the industry.
Tokenizer Integrity
The tokenizer is fully accessible via the Hugging Face library and the 'smollm' GitHub repository. It features a vocabulary size of 49,152 tokens and was trained specifically on the SmolLM corpus to ensure alignment with the training data. Documentation confirms support for six primary languages (English, French, Spanish, German, Italian, Portuguese) and includes the chat template logic for dual-mode reasoning, which is verifiable through the public 'tokenizer_config.json'.
Parameter Density
The model is a dense architecture with a clearly stated 3.0 billion parameters. Detailed architectural breakdowns are provided, including the hidden size (2048), intermediate size (11008), and the specific configuration of attention heads (16 query, 4 KV). Because it is a dense model, there is no ambiguity regarding active vs. total parameters, and the impact of architectural choices like tied embeddings on the parameter count is explicitly mentioned.
Training Compute
Hugging Face disclosed the specific hardware used (384 H100 GPUs) and the training duration (24 days), totaling approximately 220,000 GPU hours. The training framework (nanotron) and data processing tools (datatrove) are also public. While a specific carbon footprint calculation or exact dollar cost was not provided in the primary model card, the hardware and time metrics allow for high-fidelity third-party estimation.
Benchmark Reproducibility
The model release includes results for a wide range of standard benchmarks (HellaSwag, ARC, MMLU-Pro, etc.) and specifies the use of the 'lighteval' framework for evaluation. However, while the evaluation datasets are listed in a public collection, the exact prompt versions and few-shot configurations for every single reported score are not consolidated in a single 'reproducibility' file, requiring some effort to reconstruct from the lighteval configurations.
Identity Consistency
SmolLM3-3B exhibits high identity consistency, correctly identifying its version and origin in official documentation and through its specialized chat template. The model is transparent about its dual-mode reasoning capabilities (think/no-think) and its limitations as a 3B parameter model. There are no documented instances of the model claiming to be a competitor's product or misrepresenting its scale.
License Clarity
The model is released under the Apache 2.0 license, which is a standard, highly permissive open-source license. There are no conflicting terms, and commercial use, modification, and distribution are explicitly permitted. The license applies clearly to both the model weights and the associated code in the GitHub repository.
Hardware Footprint
Official documentation provides clear guidance on hardware requirements, noting that the model can run on devices with as little as 4GB-8GB of RAM. VRAM usage for inference is well-understood given the 3B parameter count (~6GB in FP16), and the model card explicitly mentions support for quantization (4-bit/8-bit) via bitsandbytes and llama.cpp, with community-verified benchmarks for these formats.
Versioning Drift
The model follows a clear versioning lineage (SmolLM -> SmolLM2 -> SmolLM3). Changes between versions, such as the increase in context length and the addition of reasoning modes, are well-documented in blog posts and commit histories. While it lacks a formal 'semantic versioning' changelog for minor weight updates, the major architectural and data shifts are transparently communicated.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online