By Jacob M. on Apr 6, 2025
Llama 4 includes several design updates over Llama 3 that focus on practical issues like memory usage, inference speed, and handling long sequences more reliably. These changes help make the model easier to deploy and more efficient, especially for tasks that involve long documents, large batch inference, or limited compute budgets.
The new architecture includes techniques like Grouped-Query Attention (GQA), updated Rotary Positional Embeddings (RoPE), chunked attention, and a Mixture of Experts (MoE) setup that supports quantization. These help with model scaling without needing a full retraining pass when adapting to longer sequences or smaller deployment footprints.
GQA offers a middle option between Multi-Head Attention (MHA) and Multi-Query Attention (MQA). In MHA, each attention head has its own key and value projections. In MQA, all heads share one set. GQA groups several query heads together so they share key/value projections.
The main benefit is reduced memory use during autoregressive decoding. When generating text token-by-token, the model caches past keys and values. In standard MHA, this cache grows large very quickly with long inputs or big batches. GQA cuts this down significantly, often with little to no performance drop compared to MHA. It's more memory-efficient like MQA but preserves more of the original model’s capacity.
RoPE is used to encode token positions by rotating the query and key vectors based on their position in the sequence. Llama 4 introduces a scaled version that adapts the rotation frequencies based on context length.
This helps the model generalize better to sequences longer than what it saw during training. In practice, this means Llama 4 can handle sequences of 256K tokens or more with better stability and performance than older models. This is useful for anyone working with long documents or large inputs that previously caused attention issues.
Another update is iRoPE, where the model alternates layers that use RoPE with layers that don’t have any explicit positional encoding (sometimes called NoPE). These NoPE layers rely on other tricks, like modified softmax temperatures or attention masking, to keep track of token order.
This design saves compute by not applying RoPE on every layer. It also appears to help with extremely long contexts. Llama 4 Scout, for example, is designed to handle up to 10 million tokens, and this alternating RoPE setup plays a role in making that feasible without requiring new training from scratch.
Instead of computing attention scores for all token pairs (which has O(n²) complexity), chunked attention limits calculations to smaller windows or blocks. This approach often brings the cost closer to linear time (O(n * c)), where c is the chunk size.
It means the model only focuses on nearby tokens, which is fine for many structured tasks like code or log analysis where the important context is local. Long-range dependencies are sacrificed, but the speed and memory savings are substantial.
Llama 4 uses a sparse Mixture of Experts setup where only a small subset of the model is active per token. For example, in the Maverick variant, each token goes through one expert and one shared layer out of 128 total experts. This makes it possible to have a model with a large parameter count but keep the compute low per inference step.
To support this efficiently, Llama 4 uses quantization that varies across layers. Lower precision formats like INT4 or FP8 are used for heavy expert layers, while more sensitive layers keep higher precision like FP16 or BF16. This helps the model run on a single high-end GPU like an H100, without losing accuracy where it matters.
Llama 4 focuses on making large models more usable and practical for long-context tasks. Grouped-Query Attention and chunked attention reduce the memory and compute cost. Scaled and interleaved RoPE improve how the model handles long sequences. The sparse Mixture of Experts combined with mixed-precision quantization allows models with billions of parameters to be deployed on limited hardware, without major compromises.
Recommended Posts
© 2025 ApX Machine Learning. All rights reserved.
LangML