Llama 3 70B: Specifications and GPU VRAM Requirements

Llama 3 70B

Open Source

Open Weights

Parameters

70B

Context Length

8.192K

Modality

Text

Architecture

Dense

License

Meta Llama 3 Community License

Release Date

18 Apr 2024

Knowledge Cutoff

Dec 2023

Technical Specifications

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

8192

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

Normalization

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

Llama 3 70B

Meta Llama 3 70B is a 70-billion-parameter, decoder-only transformer language model developed by Meta. Released in April 2024, it is provided in both pre-trained and instruction-fine-tuned variants. The instruction-tuned model is specifically optimized for dialogue and assistant-style interactions, supporting a wide array of natural language understanding and generation tasks. These include conversational AI applications, creative content generation, code generation, text summarization, classification, and complex reasoning challenges. The model is made available for both commercial and research applications under the Meta Llama 3 Community License.

Architecturally, Llama 3 70B employs a standard decoder-only transformer design. A key innovation is its tokenizer, which features a vocabulary size of 128,000 tokens, contributing to enhanced language encoding efficiency and optimized inference. To further improve inference scalability and speed, the model integrates Grouped Query Attention (GQA). This attention mechanism is applied across both the 8B and 70B parameter versions of Llama 3. Initial training of the model was conducted on sequences up to 8,192 tokens. For the instruction-tuned variants, supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) were utilized to align model outputs with human preferences for helpfulness and safety.

The Llama 3 70B model is engineered for general-purpose applications, serving as a foundational technology that can be further adapted for domain-specific tasks. Its capabilities extend to powering advanced assistant functionalities, as demonstrated by its integration into Meta AI applications across various platforms. The model's design focuses on enabling developers to build diverse generative AI applications, from complex coding assistants to long-form text summarization tools, while offering control and flexibility in deployment environments, including on-premise, cloud, and local setups.

About Llama 3

Meta's Llama 3 is a series of large language models utilizing a decoder-only transformer architecture. It incorporates a 128K token vocabulary and Grouped Query Attention for efficient processing. Models are trained on substantial public datasets, supporting various parameter scales and extended context lengths.

Other Llama 3 Models

Llama 3 8B

Evaluation Benchmarks

Ranking is for Local LLMs.

Rank

#33

Benchmark	Score	Rank
Refactoring Aider Refactoring	0.49	10
Coding Aider Coding	0.49	15

Rankings

Overall Rank

#33

Coding Rank

#32

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Download Weights Source Code