ApX logoApX logo

Llama 3.3 70B

Parameters

70B

Context Length

130K

Modality

Text

Architecture

Dense

License

Llama 3.3 Community License

Release Date

7 Dec 2024

Knowledge Cutoff

Dec 2023

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

64

Key-Value Heads

8

Attention Head Dimension

128

Position Embedding

ROPE

RoPE Theta

500,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

8,192

Number of Layers

80

FFN Intermediate Size (Dense)

28,672

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

128,256

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 8.2k · Context: 130k · Vocab: 128.3kx 80 layersRMSNormPre-AttentionGrouped-Query Attention64Q / 8KV headsHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 28.7k+Final RMSNormOutput Logits

Llama 3.3 70B

The Meta Llama 3.3 70B is a large language model engineered for text-based generative applications. It operates as a dense Transformer model, incorporating an optimized architectural design. This model variant is specifically instruction-tuned for dialogue, demonstrating proficiency in multilingual chat scenarios, code assistance, and synthetic data generation. Its development involved extensive pretraining on approximately 15 trillion tokens sourced from publicly available online datasets.

From an architectural perspective, Llama 3.3 70B integrates Grouped-Query Attention (GQA) to enhance inference scalability and efficiency. The model's training regimen includes supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF), which are applied to align its outputs with human preferences for helpfulness and safety. A notable feature is its extended context window, supporting up to 130,000 tokens, enabling the processing and generation of longer text sequences for advanced use cases such as long-form summarization and complex multi-turn conversations.

The model is equipped with capabilities for multilingual inputs and outputs, encompassing languages such as English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. Furthermore, it supports tool-use, providing developers with the ability to extend its functionality via custom function definitions and integration with third-party services. This design emphasizes efficiency and aims to reduce hardware requirements, thereby increasing the accessibility of high-quality AI for various applications.

About Llama 3.3

Meta's Llama 3.3 is a 70 billion parameter, multilingual large language model. It utilizes an optimized transformer architecture, incorporating Grouped-Query Attention for enhanced inference efficiency. The model features an extended 128k token context window and is designed to support quantization, facilitating deployment on varied hardware configurations.


Other Llama 3.3 Models
  • No related models available

Evaluation Benchmarks

Rank

#91

BenchmarkScoreRank

General Knowledge

MMLU

0.86

11

0.895

15

0.681

23

Professional Knowledge

MMLU Pro

0.70

49

Web Development

WebDev Arena

1320

52

Rankings

Overall Rank

#91

Coding Rank

#64

Model Integrity

Total Score

B

69 / 100

Llama 3.3 70B Model Integrity Report

Total Score

69

/ 100

B

Audit Note

Llama 3.3 70B demonstrates strong transparency in its architectural specifications, tokenizer details, and compute resource disclosure. However, it maintains significant opacity regarding the specific composition of its 15-trillion-token training dataset and relies on a restrictive custom license. While it provides a clear identity and versioning, the reproducibility of its benchmark results remains a challenge for independent verifiers.

Upstream

20.5 / 30

Architectural Provenance

7.5 / 10

Llama 3.3 70B is explicitly documented as an auto-regressive dense Transformer model. Meta provides detailed technical specifications including the use of Grouped-Query Attention (GQA) for inference efficiency and a 128k token context window. The model's evolution from Llama 3.1 is clear, utilizing similar architectural foundations but with updated post-training methodologies (SFT and RLHF). While the high-level architecture is well-documented in the Llama 3 technical report and model cards, specific low-level architectural modifications unique to the 3.3 variant are described more as 'optimizations' rather than fully detailed structural changes.

Dataset Composition

4.0 / 10

Meta discloses that the model was pretrained on approximately 15 trillion tokens from 'publicly available online sources' with a cutoff of December 2023. For fine-tuning, they mention using over 25 million synthetic examples and publicly available instruction datasets. However, there is no specific breakdown of the data sources (e.g., percentage of code, web, books) or detailed disclosure of the filtering and cleaning methodologies beyond general mentions of heuristic and NSFW filters. The lack of granular composition data remains a significant transparency gap.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly accessible via the official GitHub repository and Hugging Face. It features a vocabulary size of 128,256 tokens, which is a significant increase from the 32k used in Llama 2, aimed at improving multilingual efficiency. The tokenization approach is well-documented, and the vocabulary is consistent across official API and local implementations. The alignment with the claimed 8 supported languages is verifiable through the tokenizer's performance on those scripts.

Model

29.5 / 40

Parameter Density

8.5 / 10

The model is clearly defined as a dense architecture with 70.6 billion total parameters. Unlike Mixture-of-Experts (MoE) models where active parameters can be obscured, Llama 3.3 70B's dense nature means all parameters are active during inference. The parameter count is consistent across all official documentation, and the architectural breakdown (e.g., GQA implementation) is provided in technical reports.

Training Compute

7.0 / 10

Meta provides specific compute metrics, stating that training utilized approximately 39.3 million GPU hours on H100-80GB hardware (700W TDP). They also disclose the estimated environmental impact, citing 11,390 tons of CO2eq for the training process. While the hardware type and total hours are clear, the specific cluster configuration and exact training duration in days/months are less explicitly detailed compared to the most transparent research papers.

Benchmark Reproducibility

5.0 / 10

Meta publishes scores for standard benchmarks like MMLU, GPQA, and HumanEval. However, the exact prompts and few-shot examples used to achieve these specific scores are not always fully disclosed in a single reproducible repository. While third-party tools like 'lm_eval' can be used to approximate these results, discrepancies between official claims and independent audits are common, and the lack of a 'one-click' reproduction script for official numbers limits transparency.

Identity Consistency

9.0 / 10

The model consistently identifies itself as a Meta Llama model and is aware of its versioning (Llama 3.3). It maintains a coherent identity across different platforms and does not typically claim to be a competitor's model. Its capabilities and limitations, such as being text-only and having a December 2023 knowledge cutoff, are clearly stated in the model card and reflected in its behavior.

Downstream

18.5 / 30

License Clarity

6.0 / 10

The model uses the 'Llama 3.3 Community License,' which is a custom license rather than a standard OSI-approved open-source license like Apache 2.0. While it allows for commercial use and derivative works, it includes a significant restriction: companies with over 700 million monthly active users must request a separate license from Meta. This 'open weights' but not 'open source' distinction is clearly stated but introduces legal complexity for large-scale users.

Hardware Footprint

7.5 / 10

VRAM requirements are well-documented by both Meta and the community. For the 70B model, approximately 140GB of VRAM is required for FP16, while 4-bit quantization (INT4) reduces this to roughly 35-40GB. Meta provides guidance on using tools like bitsandbytes for quantization. However, official documentation on the specific accuracy-performance tradeoffs for various quantization levels (e.g., PPL loss per bit) is less comprehensive than community-driven benchmarks.

Versioning Drift

5.0 / 10

Meta uses a versioning system (3.1, 3.2, 3.3), but the changelogs are often high-level, focusing on 'improved reasoning' or 'better coding' rather than detailed technical diffs of weight changes or specific safety alignment shifts. There is no formal mechanism provided by Meta to access specific 'sub-versions' if weights are updated silently, and documentation on model drift over time is primarily left to third-party researchers.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
127k

VRAM Required:

Recommended GPUs