Llama 3.1 8B: Specifications and GPU VRAM Requirements

Llama 3.1 8B

Open Source

Open Weights

Parameters

Context Length

131.072K

Modality

Text

Architecture

Dense

License

Llama 3.1 Community License

Release Date

23 Jul 2024

Knowledge Cutoff

Dec 2023

Technical Specifications

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

4096

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

Normalization

RMS Normalization

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

Llama 3.1 8B

The Llama 3.1 8B model is a component of the Meta Llama 3.1 series, a collection of large language models developed by Meta. This model variant, featuring 8 billion parameters, is engineered to serve a range of natural language understanding and generation tasks. Its design prioritizes efficiency and responsiveness, making it suitable for deployment in environments with computational constraints. The model is optimized for dialogue applications and is designed to adhere to complex instructions, supporting its utility in conversational agents and virtual assistant systems.

Architecturally, Llama 3.1 8B is built upon an optimized transformer framework, employing a dense network configuration. A notable innovation is the integration of Grouped-Query Attention (GQA), which enhances inference scalability. The internal mechanics of the model incorporate the SiLU (Swish) activation function and RMSNorm for effective normalization across its layers. Positional encodings are managed through Rotary Position Embedding (RoPE), and the architecture leverages Flash Attention to improve processing speed. The model's training involved a substantial dataset of approximately 15 trillion tokens from publicly available sources, augmented with supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align its outputs with desired helpfulness and safety criteria. A significant enhancement in this iteration is the expanded context length, which now extends to 128,000 tokens.

Regarding its capabilities and applications, the Llama 3.1 8B model is proficient in tasks such as text summarization, text classification, and sentiment analysis, particularly in scenarios demanding low-latency inference. Its multilingual support extends to eight languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai, facilitating its application in diverse linguistic contexts. The model also supports advanced workflows, including long-form text summarization, and can be utilized in processes such as synthetic data generation and model distillation to refine smaller language models.

About Llama 3.1

Llama 3.1 is Meta's advanced large language model family, building upon Llama 3. It features an optimized decoder-only transformer architecture, available in 8B, 70B, and 405B parameter versions. Significant enhancements include an expanded 128K token context window and improved multilingual capabilities across eight languages, refined through data and post-training procedures.

Other Llama 3.1 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

Rank

#58

Benchmark	Score	Rank
Professional Knowledge MMLU Pro	0.48	🥉 3
Graduate-Level QA GPQA	0.54	11
Refactoring Aider Refactoring	0.38	15
StackEval ProLLM Stack Eval	0.5	15
Coding Aider Coding	0.38	20
Summarization ProLLM Summarization	0.49	20
Coding LiveBench Coding	0.11	32
Data Analysis LiveBench Data Analysis	0.33	32
Reasoning LiveBench Reasoning	0.15	33
Mathematics LiveBench Mathematics	0.15	34
General Knowledge MMLU	0.30	38

Rankings

Overall Rank

#58

Coding Rank

#48

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

64k

128k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Read the Paper Download Weights Source Code