ApX logoApX logo

SmolLM3 3B

Parameters

3B

Context Length

128K

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

8 Jul 2025

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

16

Key-Value Heads

4

Attention Head Dimension

-

Position Embedding

Absolute Position Embedding

RoPE Theta

5,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

Swish

Dimensions

Hidden Dimension Size

2,048

Number of Layers

36

FFN Intermediate Size (Dense)

11,008

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

128,256

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 2k · Context: 128k · Vocab: 128.3kx 36 layersRMSNormPre-AttentionMulti-Head Attention16Q / 4KV headsHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwishIntermediate: 11k+Final RMSNormOutput Logits

SmolLM3 3B

The SmolLM3-3B model, developed by Hugging Face, represents a compact yet highly capable large language model (LLM) within the 'Smol' family, specifically engineered for efficiency and performance in resource-constrained environments. This pretrained, open-weights base model integrates multilingual understanding, extended context processing, and dual-mode reasoning capabilities within a 3-billion-parameter footprint. Its design aims to democratize advanced AI by providing a powerful solution that can operate effectively on edge devices, mobile applications, and systems with limited computational resources. The model is part of a broader initiative to create lightweight yet impactful AI solutions, making sophisticated language understanding and generation more accessible.

Architecturally, SmolLM3-3B is a decoder-only Transformer model, building upon the foundational designs of models like Llama while incorporating specialized optimizations. Key innovations include the adoption of Grouped Query Attention (GQA), which utilizes 4 key-value heads to significantly reduce the KV cache size during inference without compromising performance, compared to traditional multi-head attention. It also features No Positional Encoding (NoPE), a modification where rotary position embeddings (RoPE) are selectively removed from every fourth layer, enhancing long-context performance. The model comprises 36 hidden layers with a hidden dimension size of 2048 and 16 attention heads. Input and output embeddings are tied to further reduce the memory footprint.

The training regimen for SmolLM3-3B involved a three-stage curriculum on an extensive 11.2 trillion tokens, drawing from diverse public datasets covering web content, code, mathematics, and reasoning data. This comprehensive pretraining establishes robust multilingual and general-purpose capabilities. The model's context length is natively 64,000 tokens, which is further extended to 128,000 tokens through YaRN extrapolation. SmolLM3-3B supports advanced functionalities such as tool calling using structured schemas (XML and Python tools), enabling its integration into complex agent workflows. Its design focuses on delivering competitive performance in areas like reasoning, knowledge retention, and multilingual tasks, positioning it for applications requiring efficient, high-quality language processing on various platforms.

About SmolLM Family

SmolLM open-weight language models (e.g. SmolLM3)


Other SmolLM Family Models
  • No related models available

Evaluation Benchmarks

Rank

#71

No evaluation benchmarks for SmolLM3 3B available.

Rankings

Overall Rank

#71

Coding Rank

-

Model Integrity

Total Score

B+

83 / 100

SmolLM3 3B Model Integrity Report

Total Score

83

/ 100

B+

Audit Note

SmolLM3-3B demonstrates a high standard of transparency, particularly regarding its architectural modifications and the specific composition of its 11-trillion-token training corpus. The model's openness is bolstered by the use of a permissive Apache 2.0 license and the disclosure of specific training compute resources. While benchmark reproducibility could be further streamlined with more explicit prompt documentation, the overall profile is exemplary for an open-weights release.

Upstream

25.5 / 30

Architectural Provenance

8.5 / 10

The model architecture is extensively documented as a decoder-only Transformer based on the Llama design with specific, well-defined modifications. These include Grouped Query Attention (GQA) with 4 key-value heads and a unique 'No Positional Encoding' (NoPE) approach applied in a 3:1 layer ratio. Technical specifications such as 36 hidden layers, a hidden dimension of 2048, and tied embeddings are publicly available. The training methodology is described as a three-stage curriculum (Stable, Mid-training, and Post-training) with clear objectives for each phase.

Dataset Composition

8.0 / 10

Hugging Face has provided a high level of transparency regarding the 11.2 trillion token training corpus. The data mixture is broken down by stage (e.g., Stage 1: 85% web, 12% multilingual) and specific public datasets are named, including FineWeb-Edu, DCLM, FineWeb2, and The Stack. The transition between stages and the inclusion of specific reasoning datasets like OpenMathReasoning and synthetic data from Qwen3-32B are documented. While the exact per-file filtering code isn't fully public, the methodology and ratios are exemplary for the industry.

Tokenizer Integrity

9.0 / 10

The tokenizer is fully accessible via the Hugging Face library and the 'smollm' GitHub repository. It features a vocabulary size of 49,152 tokens and was trained specifically on the SmolLM corpus to ensure alignment with the training data. Documentation confirms support for six primary languages (English, French, Spanish, German, Italian, Portuguese) and includes the chat template logic for dual-mode reasoning, which is verifiable through the public 'tokenizer_config.json'.

Model

32.5 / 40

Parameter Density

9.5 / 10

The model is a dense architecture with a clearly stated 3.0 billion parameters. Detailed architectural breakdowns are provided, including the hidden size (2048), intermediate size (11008), and the specific configuration of attention heads (16 query, 4 KV). Because it is a dense model, there is no ambiguity regarding active vs. total parameters, and the impact of architectural choices like tied embeddings on the parameter count is explicitly mentioned.

Training Compute

7.5 / 10

Hugging Face disclosed the specific hardware used (384 H100 GPUs) and the training duration (24 days), totaling approximately 220,000 GPU hours. The training framework (nanotron) and data processing tools (datatrove) are also public. While a specific carbon footprint calculation or exact dollar cost was not provided in the primary model card, the hardware and time metrics allow for high-fidelity third-party estimation.

Benchmark Reproducibility

6.5 / 10

The model release includes results for a wide range of standard benchmarks (HellaSwag, ARC, MMLU-Pro, etc.) and specifies the use of the 'lighteval' framework for evaluation. However, while the evaluation datasets are listed in a public collection, the exact prompt versions and few-shot configurations for every single reported score are not consolidated in a single 'reproducibility' file, requiring some effort to reconstruct from the lighteval configurations.

Identity Consistency

9.0 / 10

SmolLM3-3B exhibits high identity consistency, correctly identifying its version and origin in official documentation and through its specialized chat template. The model is transparent about its dual-mode reasoning capabilities (think/no-think) and its limitations as a 3B parameter model. There are no documented instances of the model claiming to be a competitor's product or misrepresenting its scale.

Downstream

25.0 / 30

License Clarity

10.0 / 10

The model is released under the Apache 2.0 license, which is a standard, highly permissive open-source license. There are no conflicting terms, and commercial use, modification, and distribution are explicitly permitted. The license applies clearly to both the model weights and the associated code in the GitHub repository.

Hardware Footprint

8.0 / 10

Official documentation provides clear guidance on hardware requirements, noting that the model can run on devices with as little as 4GB-8GB of RAM. VRAM usage for inference is well-understood given the 3B parameter count (~6GB in FP16), and the model card explicitly mentions support for quantization (4-bit/8-bit) via bitsandbytes and llama.cpp, with community-verified benchmarks for these formats.

Versioning Drift

7.0 / 10

The model follows a clear versioning lineage (SmolLM -> SmolLM2 -> SmolLM3). Changes between versions, such as the increase in context length and the addition of reasoning modes, are well-documented in blog posts and commit histories. While it lacks a formal 'semantic versioning' changelog for minor weight updates, the major architectural and data shifts are transparently communicated.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs

SmolLM3 3B: Specifications and GPU VRAM Requirements