Masterclass
Training large language models over extended periods frequently encounters unexpected difficulties. Runs that seemed stable can abruptly diverge, generating invalid outputs such as NaN
(Not a Number) losses or exhibiting sudden, sharp increases in the loss function. These events stop progress and consume substantial computational resources.
This chapter concentrates on addressing these training challenges. You will learn to recognize common indicators of instability by monitoring key metrics like loss values and gradient norms (∣∣∇L∣∣). We will cover methods for diagnosing the underlying causes of issues such as loss spikes and numerical precision errors, especially when using mixed-precision formats (FP16 or BF16). We will also revisit stabilization methods, including gradient clipping and appropriate learning rate adjustments, and consider how architectural decisions impact training steadiness. By the end of this chapter, you will be better equipped to anticipate, diagnose, and resolve instabilities during large-scale training runs.
24.1 Common Symptoms of Instability
24.2 Monitoring Training Metrics (Loss, Grad Norm)
24.3 Diagnosing Loss Spikes
24.4 Debugging Numerical Precision Issues
24.5 Stabilization Techniques Revisited (Clipping, LR, Warmup)
24.6 Impact of Architectural Choices on Stability
© 2025 ApX Machine Learning