Masterclass
Training large language models pushes the boundaries of computation, making specialized hardware not just beneficial, but often essential. While various accelerators exist, NVIDIA GPUs have become the workhorse for most large-scale AI training efforts due to their mature software ecosystem (CUDA) and hardware specifically designed for deep learning workloads. Understanding the architectural features of these GPUs, particularly recent generations like Ampere and Hopper, is significant for anyone planning or executing LLM training.
At the heart of NVIDIA GPUs are thousands of processing units called CUDA cores, enabling massive parallelism suitable for matrix multiplications and other operations common in neural networks. However, the real acceleration for deep learning comes from specialized units called Tensor Cores. Introduced in the Volta architecture and significantly enhanced in subsequent generations, Tensor Cores perform mixed-precision matrix multiply-accumulate operations at much higher throughputs than standard CUDA cores operating on FP32 precision.
The NVIDIA Ampere architecture, exemplified by the A100 GPU, represented a substantial leap forward for AI training when introduced. It brought several improvements directly benefiting large model training:
Third-Generation Tensor Cores: Ampere Tensor Cores expanded the range of supported precisions. Crucially, they introduced TensorFloat-32 (TF32). TF32 uses the same 10-bit mantissa as FP16 for multiplication precision but retains the 8-bit exponent of FP32, offering a good balance between the speed/memory benefits of lower precision and the numerical range of FP32. This often allows near-FP32 accuracy with significantly higher throughput (up to 8x theoretical speedup for matrix math compared to FP32 on the previous generation) without requiring explicit code changes in many frameworks. Ampere also enhanced performance for FP16, BF16 (BFloat16, offering wider range than FP16, beneficial for training stability), and INT8/INT4 for inference acceleration. Furthermore, these Tensor Cores introduced fine-grained structured sparsity, allowing acceleration if parts of the weight matrices can be pruned in specific patterns, potentially doubling throughput.
import torch
# Check if TF32 is available and enabled on Ampere or later GPUs
# PyTorch enables TF32 matrix multiplication operations by default on Ampere+
print(f"TF32 enabled on matmul: {torch.backends.cuda.matmul.allow_tf32}")
print(f"TF32 enabled on cuDNN: {torch.backends.cudnn.allow_tf32}")
# Explicitly enable or disable TF32 for matmul
# torch.backends.cuda.matmul.allow_tf32 = True # Default
# torch.backends.cuda.matmul.allow_tf32 = False
Increased HBM2e Memory: Large models require vast amounts of memory to store parameters, activations, optimizer states, and intermediate gradients. The A100 offered configurations with 40GB or 80GB of High Bandwidth Memory (HBM2e), providing significantly more capacity and bandwidth (up to ~2 TB/s) compared to previous generations. This larger memory footprint directly enables the training of bigger models or the use of larger batch sizes without running into out-of-memory errors.
Third-Generation NVLink: Training models too large for a single GPU's memory necessitates distributing the model and computation across multiple GPUs. The speed of communication between these GPUs becomes a bottleneck. Ampere featured third-generation NVLink, providing much higher GPU-to-GPU direct bandwidth (e.g., 600 GB/s total bandwidth per A100) compared to standard PCIe lanes. This is fundamental for efficient implementation of model parallelism (Tensor Parallelism, Pipeline Parallelism) and reducing communication overhead in data parallelism.
Simplified view of NVLink providing high-bandwidth direct connections between GPUs in a server node, compared to slower PCIe connections to the host CPU.
Multi-Instance GPU (MIG): While less directly used for single, massive training runs, MIG allows partitioning a single A100 into up to seven independent GPU instances, each with its own memory, cache, and compute cores. This is useful for maximizing utilization by running smaller inference workloads or development tasks in parallel.
The Hopper architecture, typified by the H100 GPU, builds upon Ampere with features explicitly targeting the demands of enormous models like GPT-4 scale Transformers.
Fourth-Generation Tensor Cores and Transformer Engine: Hopper introduces support for a new 8-bit floating-point format (FP8), consisting of two variants: E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits). FP8 offers double the throughput and half the memory footprint compared to FP16/BF16. Critically, Hopper includes the Transformer Engine. This engine uses software and hardware heuristics to dynamically analyze layer statistics during training and decide whether to use FP8 or FP16/BF16 for specific matrix multiplications within Transformer layers, while maintaining higher-precision accumulations and high-fidelity results. This aims to deliver the speed and memory benefits of FP8 without requiring extensive manual tuning or sacrificing model accuracy, which is particularly advantageous for multi-trillion parameter models.
Flow showing the Transformer Engine analyzing layer statistics to select the optimal precision (FP8 or higher) for Hopper's Tensor Cores.
HBM3 Memory: The H100 utilizes HBM3 memory, pushing capacity (typically 80GB) and significantly increasing bandwidth (up to ~3.35 TB/s) compared to the A100's HBM2e. This further alleviates memory bottlenecks, allowing for even larger models, activations, or training data batches to reside directly on the GPU.
Fourth-Generation NVLink and NVLink Switch System: Hopper increases the bandwidth of its direct GPU-to-GPU NVLink connections (up to 900 GB/s total bandwidth per H100). Perhaps more significantly, NVIDIA introduced the NVLink Switch System. This allows connecting up to 256 H100 GPUs within a specialized "NVLink Domain," providing all-to-all high-bandwidth communication directly between GPUs without needing to traverse slower networking fabrics like InfiniBand for certain communication patterns within the domain. This is designed to drastically improve scaling efficiency for extremely large model training runs that span many nodes.
DPX Instructions: Hopper includes new instructions designed to accelerate algorithms involving dynamic programming. While potentially applicable to areas like sequence alignment or bioinformatics, their direct impact on standard Transformer training might be less pronounced than the improvements in Tensor Cores or memory bandwidth.
Moving from Ampere (A100) to Hopper (H100) brings tangible benefits for LLM training:
Feature | Ampere (A100) | Hopper (H100) | Significance for LLMs |
---|---|---|---|
Tensor Core Gen | 3rd | 4th | Higher throughput, new precision support. |
Precision | TF32, FP16, BF16 | FP8 (via Transformer Engine), FP16, BF16 | FP8 dramatically increases speed & reduces memory. |
Memory Type | HBM2e | HBM3 | Higher bandwidth & capacity supports larger models. |
Max Memory | 80 GB | 80 GB (SXM variant) | Accommodates larger states and activations. |
Memory Bandwidth | ~2.0 TB/s | ~3.35 TB/s | Faster data access, less memory-bound execution. |
NVLink Gen | 3rd (600 GB/s) | 4th (900 GB/s) + Switch System | Faster inter-GPU communication, better large-scale scaling. |
Approximate theoretical peak throughput for Tensor Core operations highlights the significant performance gains, especially with FP8 on Hopper. Actual performance varies by workload.
While Hopper offers superior performance, Ampere GPUs remain powerful and widely used. The choice between them, or other potential hardware, involves balancing performance needs against budget constraints and hardware availability. Understanding these architectural differences helps in selecting the appropriate hardware and configuring training jobs to leverage their specific strengths effectively.
© 2025 ApX Machine Learning