ApX logoApX logo

Llama 3 8B

Parameters

8B

Context Length

8.192K

Modality

Text

Architecture

Dense

License

Meta Llama 3 Community License Agreement

Release Date

18 Apr 2024

Knowledge Cutoff

Mar 2023

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

32

Key-Value Heads

8

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

-

Sliding Window Attention

-

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

32

FFN Intermediate Size (Dense)

-

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

-

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 4.1k · Context: 8.2kx 32 layersRMSNormPre-AttentionGrouped-Query Attention32Q / 8KV headsHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwiGLU+Final RMSNormOutput Logits

Llama 3 8B

Meta Llama 3 is a foundational large language model developed by Meta AI, designed to facilitate advanced text and code generation across a diverse range of applications. It is made available in multiple parameter scales, including an 8 billion parameter variant, and is provided in both pre-trained and instruction-tuned forms. The architecture is engineered for scalability and responsible deployment in artificial intelligence systems, supporting various use cases from assistant-style conversational agents to complex natural language processing research tasks.

The model employs a decoder-only transformer architecture, incorporating several technical enhancements over its predecessors. Key innovations include an optimized tokenizer with a 128,000-token vocabulary, which contributes to increased encoding efficiency for language. Additionally, the model integrates Grouped-Query Attention (GQA) across both its 8 billion and 70 billion parameter versions, a modification aimed at improving inference efficiency. For enhanced training stability, Llama 3 utilizes Root Mean Square Normalization (RMSNorm) applied as pre-normalization and employs the SwiGLU activation function. Positional encodings within the model are handled through Rotary Positional Embeddings (RoPE).

Llama 3 8B has been pre-trained on a vast corpus exceeding 15 trillion tokens sourced from publicly available datasets, representing a substantial increase in training data volume compared to prior Llama iterations. This model supports a context length of 8,192 tokens. It demonstrates capabilities in generating coherent text, assisting with code completion, and engaging in conversational tasks, and its capabilities extend to multiple languages and tool use in later iterations (Llama 3.1).

About Llama 3

Meta's Llama 3 is a series of large language models utilizing a decoder-only transformer architecture. It incorporates a 128K token vocabulary and Grouped Query Attention for efficient processing. Models are trained on substantial public datasets, supporting various parameter scales and extended context lengths.


Other Llama 3 Models

Evaluation Benchmarks

Rank

#142

BenchmarkScoreRank

Web Development

WebDev Arena

1223

76

Rankings

Overall Rank

#142

Coding Rank

#95

Model Integrity

Total Score

B

69 / 100

Llama 3 8B Model Integrity Report

Total Score

69

/ 100

B

Audit Note

Llama 3 8B exhibits high transparency in its architectural design and compute resource disclosure, providing a level of technical detail that sets a strong industry standard. However, the model's transparency is hindered by the use of a restrictive custom license and a lack of granular detail regarding the specific sources within its 15-trillion-token training corpus. While the model's identity and hardware requirements are well-defined, improvements in benchmark reproducibility and more detailed dataset disclosures are necessary for a top-tier transparency rating.

Upstream

21.5 / 30

Architectural Provenance

8.0 / 10

Meta provides comprehensive documentation for the Llama 3 architecture in their official technical report and model cards. The 8B variant is explicitly defined as a dense, decoder-only transformer with 32 layers, a hidden dimension of 4096, and 32 attention heads. Key technical modifications like Grouped-Query Attention (GQA) with 8 KV heads, SwiGLU activation, and Rotary Positional Embeddings (RoPE) with a base frequency of 500,000 are clearly documented. The training methodology, including the use of RMSNorm for pre-normalization and the specific AdamW optimizer hyperparameters, is publicly available.

Dataset Composition

4.5 / 10

While Meta discloses the scale of the pre-training data (15T+ tokens) and provides a high-level categorical breakdown (e.g., 5% non-English across 30+ languages, with specific mentions of code and mathematics), they do not release the actual dataset or provide a granular percentage-based composition of sources. Documentation mentions 'publicly available online data' and describes filtering/cleaning steps (PII removal, deduplication, and quality filtering), but the lack of specific source naming or a detailed data mixture prevents a higher score.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly accessible via the official GitHub repository and Hugging Face. It uses a TikToken-based BPE approach with a clearly stated vocabulary size of 128,256 tokens. This expanded vocabulary is documented as a key improvement for encoding efficiency across diverse domains. The tokenizer's behavior, including special tokens like <|begin_of_text|> and <|eot_id|>, is well-documented for both pre-trained and instruction-tuned variants, allowing for full verification and local testing.

Model

29.0 / 40

Parameter Density

8.5 / 10

The model is explicitly identified as a dense architecture with 8.03 billion total parameters. Meta provides a detailed breakdown of parameter allocation, such as the 1.05 billion parameters dedicated to the embedding and language modeling heads (12.5% of the total). There is no ambiguity regarding active vs. total parameters as it is not a Mixture-of-Experts (MoE) model. The impact of the large vocabulary on parameter density is clearly explained in technical documentation.

Training Compute

7.5 / 10

Meta provides specific details regarding the compute resources used for Llama 3 8B. Pre-training utilized approximately 1.3 million GPU hours on NVIDIA H100-80GB hardware. The report includes estimated carbon emissions (390 tCO2eq) and power consumption metrics. While the exact cost is not stated, the hardware specifications and duration allow for accurate third-party estimation. The use of custom training libraries and the Research SuperCluster (RSC) infrastructure is also documented.

Benchmark Reproducibility

4.0 / 10

Meta reports scores across standard benchmarks (MMLU, GSM8K, HumanEval) and provides some evaluation details, such as the number of shots and prompt styles (e.g., 8-shot CoT for GSM8K). However, the full evaluation code and exact prompt templates were not initially released in a centralized, easily reproducible format. Independent researchers have noted difficulties in matching reported scores exactly due to subtle differences in prompting and parsing, though Meta has since released some 'eval_details' on GitHub to mitigate this.

Identity Consistency

9.0 / 10

The instruction-tuned variant of Llama 3 8B demonstrates high identity consistency, correctly identifying itself as a model trained by Meta. It maintains a clear versioning identity (Llama 3 vs 3.1) and is transparent about its status as an AI. The model's self-recognition capabilities are documented as an emergent behavior reinforced during the RLHF and alignment phases.

Downstream

18.5 / 30

License Clarity

6.0 / 10

The model is released under the 'Meta Llama 3 Community License Agreement.' While the license is public and allows for commercial use, it is not a standard OSI-approved open-source license. It contains significant restrictions, including a requirement for a separate license if the user has more than 700 million monthly active users and a non-compete clause regarding the use of Llama to improve other models. These custom terms create legal complexity compared to Apache 2.0 or MIT licenses.

Hardware Footprint

7.5 / 10

VRAM requirements are well-documented by both Meta and the community. For the standard BF16 precision, the model requires approximately 15-16GB of VRAM, while 4-bit quantized versions (Q4_K_M) are documented to run on ~5-6GB. Meta provides guidance on context length memory scaling (8k default) and the impact of GQA on inference efficiency. Quantization tradeoffs are widely discussed in community documentation and official model cards.

Versioning Drift

5.0 / 10

Meta uses a versioning system (Llama 3, 3.1, 3.2) and maintains a basic changelog on GitHub. However, the transition from Llama 3 to 3.1 involved significant changes in capabilities (e.g., context window expansion to 128k) that were not always clearly communicated as distinct from the base 8B model in early marketing. There have been reports of behavioral drift in instruction-following performance across minor weight updates, with limited public documentation on the specific delta between these iterations.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
4k
8k

VRAM Required:

Recommended GPUs

Llama 3 8B: Specifications and GPU VRAM Requirements