Learning Long-Term Dependencies with Gradient Descent Is Difficult, Yoshua Bengio, Patrice Simard, and Paolo Frasconi, 1994IEEE Transactions on Neural Networks, Vol. 5 (IEEE)DOI: 10.1109/72.279181 - This paper formally identifies the vanishing and exploding gradient problems in recurrent neural networks, which significantly hinder their ability to learn long-range dependencies.
Long Short-Term Memory, Sepp Hochreiter and Jürgen Schmidhuber, 1997Neural Computation, Vol. 9 (MIT Press)DOI: 10.1162/neco.1997.9.8.1735 - The original paper introducing Long Short-Term Memory (LSTM) networks, a significant architectural advancement designed to mitigate the vanishing gradient problem and better capture long-range dependencies.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, 2014Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics)DOI: 10.3115/v1/D14-1179 - Introduces Gated Recurrent Units (GRUs) as a simpler, yet effective, alternative to LSTMs for sequence modeling, also addressing the challenges of long-range dependencies.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A comprehensive textbook that covers the theoretical foundations and practical aspects of deep learning, including detailed explanations of recurrent neural networks, LSTMs, GRUs, and their limitations in handling long-range dependencies.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.) - The seminal paper that introduced the Transformer architecture, which fundamentally addresses the long-range dependency problem by relying entirely on attention mechanisms, eliminating recurrence.