Masterclass
Training large language models pushes the limits of current hardware. Storing the parameters, activations, and gradients for models with billions or trillions of parameters often exceeds the memory capacity of a single accelerator like a GPU or TPU. Similarly, the computational cost, especially the quadratic complexity O(N2) of self-attention with respect to sequence length N, makes training times prohibitively long on one device.
To overcome these limitations, we must distribute the training process across multiple computational devices. This chapter introduces the fundamental techniques for parallelizing LLM training.
You will learn about:
Understanding these strategies is essential for effectively training state-of-the-art language models. We will examine the mechanics, benefits, and trade-offs of each approach.
15.1 Motivation: Why Distributed Training?
15.2 Data Parallelism (DP)
15.3 Tensor Parallelism (TP)
15.4 Pipeline Parallelism (PP)
15.5 Interplay and Hybrid Approaches (DP+TP, DP+PP, etc.)
15.6 Communication Overhead Analysis
© 2025 ApX Machine Learning