Long Short-Term Memory, Sepp Hochreiter, Jürgen Schmidhuber, 1997Neural Computation, Vol. 9 (Massachusetts Institute of Technology)DOI: 10.1162/neco.1997.9.8.1735 - The original paper introducing the Long Short-Term Memory (LSTM) network architecture, detailing its gates and cell state for addressing the vanishing gradient problem.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, 2014Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics)DOI: 10.3115/v1/D14-1179 - This paper introduced the Gated Recurrent Unit (GRU) as a simpler alternative to LSTMs, outlining its architecture with reset and update gates.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A comprehensive textbook with a detailed chapter (Chapter 10) on sequence modeling, including explanations of recurrent neural networks, LSTMs, and GRUs.