ApX logoApX logo

Llama 4 Scout

Active Parameters

109B

Context Length

10,000K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Llama 4 Community License Agreement

Release Date

6 Apr 2025

Knowledge Cutoff

Aug 2024

Technical Specifications

Total Expert Parameters

-

Number of Experts

16

Active Experts

2

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

8192

Number of Layers

80

Attention Heads

64

Key-Value Heads

8

Activation Function

-

Normalization

-

Position Embedding

Irope

Llama 4 Scout

Llama 4 Scout is a key offering within Meta's Llama 4 family of models, released on April 5, 2025. It is designed to provide robust artificial intelligence capabilities for researchers and organizations while operating within practical hardware constraints. As a general-purpose model, Llama 4 Scout exhibits native multimodality, proficiently processing both text and image inputs. Its applications encompass a wide array of tasks, including complex conversational interactions, detailed image analysis, and advanced code generation. The model's design focuses on enabling efficient execution of these tasks across diverse computational environments.

Architecturally, Llama 4 Scout employs a Mixture-of-Experts (MoE) configuration, incorporating 109 billion total parameters, with 17 billion active parameters engaged per token across 16 experts. A significant innovation in its design is an industry-leading context window, supporting up to 10 million tokens, which represents a substantial increase over prior iterations. The model integrates an early fusion approach for its native multimodality, which unifies text and vision tokens within its foundational structure. Optimized for efficient deployment, Llama 4 Scout can run on a single NVIDIA H100 GPU when leveraging Int4 quantization. Furthermore, its architecture incorporates interleaved attention layers, specifically iRoPE, to enhance generalization capabilities across extended sequences.

Llama 4 Scout is well-suited for applications demanding the processing and analysis of extensive information volumes. Its primary use cases include multi-document summarization, detailed analysis of user activity for personalization, and reasoning over substantial codebases. The model demonstrates strong performance in tasks requiring document question-answering, precise information retrieval, and reliable source attribution, making it particularly valuable for professional document analysis. Its design for efficiency on a single GPU facilitates accessibility for organizations with varying computing infrastructure. The model also supports multilingual tasks, having been trained on data from 200 languages, with fine-tuning capabilities for 12 specific languages.

About Llama 4

Meta's Llama 4 model family implements a Mixture-of-Experts (MoE) architecture for efficient scaling. It features native multimodality through early fusion of text, images, and video. This iteration also supports significantly extended context lengths, with models capable of processing up to 10 million tokens.


Other Llama 4 Models

Evaluation Benchmarks

Rank

#92

BenchmarkScoreRank

Professional Knowledge

MMLU Pro

0.74

17

Rankings

Overall Rank

#92

Coding Rank

-

Model Transparency

Total Score

C+

59 / 100

Llama 4 Scout Transparency Report

Total Score

59

/ 100

C+

Audit Note

Llama 4 Scout presents a bifurcated transparency profile, offering high clarity on its Mixture-of-Experts architecture and hardware requirements while remaining notably opaque regarding its training data and compute resources. The model's industry-leading context window and native multimodality are well-documented, but the restrictive, geographically-fenced license and lack of reproducible benchmark code significantly limit its standing as a truly open research tool.

Upstream

18.0 / 30

Architectural Provenance

7.0 / 10

Meta provides a clear architectural name and high-level description for Llama 4 Scout, identifying it as a Mixture-of-Experts (MoE) model with 16 experts and 109B total parameters. Documentation specifies the use of 'early fusion' for native multimodality and 'iRoPE' (interleaved Rotary Positional Embeddings) for length generalization. While the model is described as being distilled from a larger 'Behemoth' teacher model, the full pretraining methodology and specific architectural modifications for the 10M context window are described in blog posts and model cards rather than a formal peer-reviewed technical paper, leaving some technical implementation details opaque.

Dataset Composition

3.0 / 10

Disclosure regarding training data is limited to high-level generalities. Meta states the model was trained on ~40 trillion tokens of 'multimodal data from a mix of publicly available, licensed data and information from Meta's products and services,' including posts from Instagram and Facebook. However, there is no public breakdown of dataset percentages (e.g., code vs. web vs. books), no detailed documentation on filtering or cleaning methodologies, and no sample data provided for verification. This falls under the 'minimal information' category with significant gaps.

Tokenizer Integrity

8.0 / 10

The tokenizer is publicly accessible via the Hugging Face repository and official GitHub. It uses a combination of BPE and WordPiece with a stated vocabulary size of approximately 128,000 tokens. Documentation includes details on special tokens for multimodal content and control. The tokenizer's support for 12 primary languages is well-documented, though its performance on the broader claimed 200 languages is less verifiable without extensive third-party testing.

Model

25.0 / 40

Parameter Density

8.0 / 10

Meta is transparent about the MoE structure, explicitly stating the model has 109 billion total parameters with 17 billion active parameters per token. The distribution across 16 experts is clearly defined. While a full architectural breakdown of attention vs. FFN parameter ratios is not provided in the standard model card, the active vs. total parameter distinction is handled with high clarity, avoiding the common pitfall of advertising only the total count.

Training Compute

4.0 / 10

Information on training compute is sparse. While some third-party reports (e.g., Azure/HPCwire) estimate the training required millions of H100 hours and provide carbon footprint estimates (approx. 1,999 tons CO2e), Meta's official documentation lacks a comprehensive compute report. There is no official disclosure of the exact hardware hours, total energy consumption, or detailed cost breakdown in the primary model documentation.

Benchmark Reproducibility

4.0 / 10

While Meta provides benchmark scores (e.g., MMLU-Pro, ChartQA) in its model cards, independent researchers have reported difficulty reproducing these results, particularly in coding and long-context tasks. Evaluation code is not fully public in a way that allows for one-click verification of the official scores. The lack of detailed prompting strategies and exact few-shot examples in official documentation further hinders reproducibility.

Identity Consistency

9.0 / 10

Llama 4 Scout demonstrates high identity consistency, correctly identifying its version and family in standard interactions. It maintains a clear distinction from its larger sibling, Maverick, and the teacher model, Behemoth. There are no documented cases of the model claiming to be a competitor's product or denying its nature as an AI developed by Meta.

Downstream

16.0 / 30

License Clarity

4.0 / 10

The 'Llama 4 Community License Agreement' is a custom, restrictive license that is not OSI-compliant. While the terms are publicly accessible, they contain significant geographic restrictions (specifically excluding EU-based entities from using multimodal features) and commercial usage caps (requiring a separate license for entities with >700M monthly active users). These 'open-weights' but not 'open-source' terms create a complex legal landscape that is less transparent than standard permissive licenses like Apache 2.0.

Hardware Footprint

7.0 / 10

Hardware requirements are well-documented for various configurations. Meta and partners (NVIDIA, Unsloth) provide specific VRAM estimates for FP16 (~218GB) and Int4 (~55GB), confirming it can run on a single H100 with quantization. However, the memory scaling for the 10M context window is less transparent; while the theoretical limit is stated, practical VRAM requirements for the KV cache at extreme lengths are only available through third-party estimates rather than official scaling tables.

Versioning Drift

5.0 / 10

Meta uses a versioning system (e.g., Llama-4-Scout-17B-16E-Instruct), but the changelog and history of updates are not maintained with the rigor of a software project. While major releases are documented, there is limited transparency regarding 'silent' updates or behavioral drift in the hosted versions of the model. The lack of a public, detailed version history for weight checkpoints reduces the score.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
4883k
9766k

VRAM Required:

Recommended GPUs