Llama 4 Scout: Specifications and GPU VRAM Requirements

Llama 4 Scout

Closed Source

Open Weights

Active Parameters

109B

Context Length

10,000K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Llama 4 Community License Agreement

Release Date

6 Apr 2025

Knowledge Cutoff

Aug 2024

Technical Specifications

Total Expert Parameters

Number of Experts

Active Experts

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

8192

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

Normalization

Position Embedding

Irope

System Requirements

VRAM requirements for different quantization methods and context sizes

Llama 4 Scout

Llama 4 Scout is a key offering within Meta's Llama 4 family of models, released on April 5, 2025. It is designed to provide robust artificial intelligence capabilities for researchers and organizations while operating within practical hardware constraints. As a general-purpose model, Llama 4 Scout exhibits native multimodality, proficiently processing both text and image inputs. Its applications encompass a wide array of tasks, including complex conversational interactions, detailed image analysis, and advanced code generation. The model's design focuses on enabling efficient execution of these tasks across diverse computational environments.

Architecturally, Llama 4 Scout employs a Mixture-of-Experts (MoE) configuration, incorporating 109 billion total parameters, with 17 billion active parameters engaged per token across 16 experts. A significant innovation in its design is an industry-leading context window, supporting up to 10 million tokens, which represents a substantial increase over prior iterations. The model integrates an early fusion approach for its native multimodality, which unifies text and vision tokens within its foundational structure. Optimized for efficient deployment, Llama 4 Scout can run on a single NVIDIA H100 GPU when leveraging Int4 quantization. Furthermore, its architecture incorporates interleaved attention layers, specifically iRoPE, to enhance generalization capabilities across extended sequences.

Llama 4 Scout is well-suited for applications demanding the processing and analysis of extensive information volumes. Its primary use cases include multi-document summarization, detailed analysis of user activity for personalization, and reasoning over substantial codebases. The model demonstrates strong performance in tasks requiring document question-answering, precise information retrieval, and reliable source attribution, making it particularly valuable for professional document analysis. Its design for efficiency on a single GPU facilitates accessibility for organizations with varying computing infrastructure. The model also supports multilingual tasks, having been trained on data from 200 languages, with fine-tuning capabilities for 12 specific languages.

About Llama 4

Meta's Llama 4 model family implements a Mixture-of-Experts (MoE) architecture for efficient scaling. It features native multimodality through early fusion of text, images, and video. This iteration also supports significantly extended context lengths, with models capable of processing up to 10 million tokens.

Other Llama 4 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

Rank

#40

Benchmark	Score	Rank
StackEval ProLLM Stack Eval	0.85	9
Graduate-Level QA GPQA	0.57	9
StackUnseen ProLLM Stack Unseen	0.16	12
QA Assistant ProLLM QA Assistant	0.87	14
Summarization ProLLM Summarization	0.68	14
General Knowledge MMLU	0.57	19
Professional Knowledge MMLU Pro	0.74	19

Rankings

Overall Rank

#40

Coding Rank

#36

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

4883k

9766k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Download Weights Source Code