Learning Long-Term Dependencies with Gradient Descent Is Difficult, Yoshua Bengio, Patrice Simard, Paolo Frasconi, 1994IEEE Transactions on Neural Networks, Vol. 5 (IEEE)DOI: 10.1109/72.279181 - This paper is a seminal work that mathematically analyzes the vanishing and exploding gradient problems in recurrent neural networks, demonstrating their inherent difficulty in learning long-term dependencies.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - Chapter 10 of this book provides a thorough explanation of recurrent neural networks, including a detailed analysis of backpropagation through time and the vanishing and exploding gradient problems.
Long Short-Term Memory, Sepp Hochreiter, Jürgen Schmidhuber, 1997Neural Computation, Vol. 9 (MIT Press)DOI: 10.1162/neco.1997.9.8.1735 - This paper introduces Long Short-Term Memory (LSTM) networks, a specialized RNN architecture designed to mitigate the vanishing gradient problem and effectively learn long-term dependencies.