Chapter 3: Mixed Precision and Memory Optimization

Even with effective sharding strategies, the memory requirements for training large language models often exceed the capacity of high-end GPUs. While parameter sharding distributes the model weights, the transient memory required for activations and gradients scales linearly with batch size and sequence length. This chapter focuses on reducing that footprint to maximize throughput and enable the training of larger architectures.

We begin by analyzing mixed precision training. Moving from FP32 to BFloat16 reduces memory usage by half and utilizes specialized tensor cores for faster matrix multiplications. You will learn to configure MixedPrecision policies in FSDP to maintain numerical stability while accelerating convergence.

Following this, we examine activation checkpointing. Instead of storing all intermediate activations for the backward pass, this technique stores only a subset and recomputes the rest on demand. This trades computational cycles for memory, typically reducing activation memory cost from $O(N)$ to $O(\sqrt{N})$ depending on the configuration, allowing you to fit significantly deeper models on existing hardware. We also cover CPU offloading, a strategy that utilizes system RAM to store optimizer states and parameters when GPU memory is exhausted. By the end of this chapter, you will be able to combine these techniques to optimize the memory-compute efficiency ratio for terabyte-scale training runs.

Sections

3.1 BFloat16 vs Float16 Configurations
3.2 Activation Checkpointing Mechanics
3.3 CPU Offloading Implementation
3.4 Gradient Accumulation with Sharding
3.5 Practice: Tuning Memory Constraints