Long Short-Term Memory, Sepp Hochreiter, Jürgen Schmidhuber, 1997Neural Computation, Vol. 9 (MIT Press)DOI: 10.1162/neco.1997.9.8.1735 - The original paper introducing Long Short-Term Memory (LSTM) networks, which effectively mitigate the vanishing gradient problem through their gating mechanisms.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A comprehensive textbook covering the theoretical foundations and practical aspects of deep learning, including detailed explanations of recurrent neural networks and the vanishing/exploding gradient problems.
On the difficulty of training recurrent neural networks, Razvan Pascanu, Tomas Mikolov, Yoshua Bengio, 2013Proceedings of the 30th International Conference on Machine Learning, Vol. 28 (PMLR) - This paper thoroughly investigates the vanishing and exploding gradient problems in RNNs and proposes practical solutions, such as gradient clipping, for stable training.