Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.) - Introduces the Transformer architecture and its specific learning rate schedule involving warmup and inverse square root decay.
SGDR: Stochastic Gradient Descent with Warm Restarts, Ilya Loshchilov, Frank Hutter, 2017International Conference on Learning Representations - Proposes cosine annealing as a learning rate scheduling strategy, which is widely used for its smooth decay profile.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - Covers foundational optimization algorithms and learning rate strategies, providing theoretical context for scheduling techniques.
tf.keras.optimizers.schedules, TensorFlow Developers, 2024 (TensorFlow) - Official documentation for implementing learning rate schedules within the TensorFlow and Keras framework.