ApX logoApX logo

Llama 3.2 3B

Parameters

3B

Context Length

128K

Modality

Text

Architecture

Dense

License

Llama 3.2 Community License

Release Date

25 Sept 2024

Knowledge Cutoff

Dec 2023

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

24

Key-Value Heads

6

Attention Head Dimension

128

Position Embedding

ROPE

RoPE Theta

500,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

Swish

Dimensions

Hidden Dimension Size

2,048

Number of Layers

26

FFN Intermediate Size (Dense)

8,192

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

128,256

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 2k · Context: 128K · Vocab: 128.3kx 26 layersRMSNormPre-AttentionGrouped-Query Attention24Q / 6KV headsHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwishIntermediate: 8.2k+Final RMSNormOutput Logits

Llama 3.2 3B

Llama 3.2 3B is a compact, instruction-tuned, and text-only generative language model developed by Meta. It is part of the Llama 3.2 model family, which also includes 1 billion parameter text models and larger multimodal variants. The model is specifically designed for efficient deployment in resource-constrained environments, such as edge and mobile devices. Its primary purpose is to facilitate scalable assistant and agentic language technologies by offering capabilities for tasks such as summarization, instruction following, rewriting, and knowledge retrieval. The model supports multilingual interactions, with official support for eight languages including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

The architectural foundation of Llama 3.2 3B is an auto-regressive transformer. Key innovations include the adoption of Grouped-Query Attention (GQA) to enhance inference scalability, a technique that improves throughput without a proportional increase in hardware demands. Training involved knowledge distillation from larger Llama variants, specifically Llama 3.1 8B and 70B models, where their output logits served as token-level targets during pre-training to recover performance after pruning. Post-training alignment, particularly for instruction-tuned versions, utilizes supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). Furthermore, the model incorporates advanced quantization techniques, employing 4-bit groupwise quantization for transformer block weights and 8-bit per-token dynamic quantization for activations, optimizing its operation for environments like PyTorch's ExecuTorch framework.

Llama 3.2 3B is engineered for robust performance in on-device scenarios, balancing computational efficiency with output quality. It features an extended context window of 128,000 tokens, enabling processing of longer inputs for tasks such as document summarization and extended conversations. While the full precision models support this context length, quantized versions are typically configured for an 8,000-token context. The model's design prioritizes low-latency inferencing, making it suitable for applications that require rapid responses and operate with limited computational resources, such as mobile AI-powered writing assistants and customer service applications. The pre-trained variants also provide a foundational basis for further fine-tuning across various natural language generation tasks.

About Llama 3.2

Meta's Llama 3.2 family introduces vision models, integrating image encoders with language models for multimodal text and image processing. It also includes lightweight variants optimized for efficient on-device deployment, supporting an extended 128K token context length.


Other Llama 3.2 Models

Evaluation Benchmarks

Rank

#142

BenchmarkScoreRank

General Knowledge

MMLU

0.634

34

Web Development

WebDev Arena

1166

103

General Text

Text Arena

1166

104

Rankings

Overall Rank

#142

Coding Rank

#123

Model Integrity

Total Score

B+

72 / 100

Llama 3.2 3B Model Integrity Report

Total Score

72

/ 100

B+

Audit Note

Llama 3.2 3B exhibits strong transparency in its architectural origins and training compute, providing specific hardware hours and environmental impact data. However, it remains opaque regarding the specific composition of its 9-trillion-token training set and utilizes a custom community license that falls short of true open-source standards. While the model's identity and hardware requirements are clearly communicated, the lack of a detailed changelog for tracking behavioral drift remains a notable weakness.

Upstream

21.0 / 30

Architectural Provenance

8.0 / 10

Meta provides high-quality documentation for the Llama 3.2 3B architecture, identifying it as an auto-regressive transformer. Key technical details such as the use of Grouped-Query Attention (GQA) and the specific training methodology—knowledge distillation from Llama 3.1 8B and 70B models—are explicitly documented. The model card and associated technical blog posts detail the pruning and recovery process used to achieve the 3B parameter count from larger base models.

Dataset Composition

4.0 / 10

While Meta discloses that the model was trained on 'up to 9 trillion tokens' from 'publicly available sources,' it fails to provide a detailed breakdown of the dataset composition (e.g., specific percentages of web, code, or books). The description of data cleaning and filtering is limited to high-level summaries of 'new mix' and 'high-quality' data without providing specific methodologies or access to sample data for verification.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly accessible via the official GitHub repository and Hugging Face. It uses the Tiktoken-based implementation with a clearly stated vocabulary size of 128,256 tokens. Documentation explains the shift from SentencePiece (used in Llama 2) and provides details on the improved compression ratios for multilingual text, which aligns with the claimed support for eight official languages.

Model

30.5 / 40

Parameter Density

7.0 / 10

The model's parameter count is clearly stated as 3.21 billion total parameters. As a dense model, all parameters are active during inference, which is transparently communicated. However, while the total count is precise, a detailed architectural breakdown of parameter distribution between attention layers and feed-forward networks is not as readily available in the primary documentation as it is for the larger 405B variant.

Training Compute

8.5 / 10

Meta provides specific compute metrics, stating that the Llama 3.2 1B and 3B models utilized a cumulative 916k GPU hours on NVIDIA H100-80GB hardware. Environmental impact data is also disclosed, including an estimated 240 tons of CO2eq for the training run, along with a description of their carbon offset methodology. This level of detail is exemplary compared to industry peers.

Benchmark Reproducibility

6.0 / 10

Meta provides a comprehensive list of benchmark results (MMLU, ARC, GSM8K, etc.) and has released an 'evals' dataset on Hugging Face containing detailed result logs. However, the exact prompts and few-shot examples used to generate the official scores are often described generally rather than provided as a complete, one-click reproduction suite. Third-party verification is available through various leaderboards, but discrepancies in 'zero-shot' vs 'few-shot' reporting across different sources persist.

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying itself as a Meta Llama model in standard interactions. It maintains awareness of its versioning (Llama 3.2) and its status as a smaller, 3B parameter model. There are no widespread reports of the model claiming to be a competitor's product or misrepresenting its core identity.

Downstream

20.0 / 30

License Clarity

7.0 / 10

The model is released under the 'Llama 3.2 Community License.' While the terms are publicly available and explicitly allow for commercial use, it is a custom license rather than a standard OSI-approved open-source license. It includes a restrictive clause for entities with over 700 million monthly active users, which creates a 'pseudo-open' status that requires careful legal review for large-scale enterprise use.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented for various precisions. Meta and partners provide clear guidance on VRAM needs (e.g., ~3.4 GB for 4-bit quantized versions with an 8k context). The documentation also distinguishes between the 128k context window supported by full-precision weights and the 8k limit typically seen in standard quantized deployments, providing a realistic view of hardware trade-offs.

Versioning Drift

5.0 / 10

Meta uses a versioning system (Llama 3.2), but the frequency and nature of 'silent' updates to the weights or safety filters are not transparently tracked in a public changelog. While the initial release is well-documented, there is no clear mechanism for users to track or opt-out of behavioral drift caused by subsequent fine-tuning or alignment adjustments made by the provider.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs

Llama 3.2 3B: Specifications and GPU VRAM Requirements