Long Short-Term Memory, Sepp Hochreiter, Jürgen Schmidhuber, 1997Neural Computation, Vol. 9 (The MIT Press)DOI: 10.1162/neco.1997.9.8.1735 - Introduces the Long Short-Term Memory (LSTM) network, a recurrent neural network architecture designed to learn long-term dependencies by overcoming the vanishing gradient problem.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, 2014Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics)DOI: 10.3115/v1/D14-1179 - Presents the Gated Recurrent Unit (GRU) as a simpler alternative to LSTMs, often used within an encoder-decoder framework for sequence modeling.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Vol. 30 (Curran Associates, Inc.)DOI: 10.55982/annips.2017.387 - Introduces the Transformer architecture, which entirely relies on self-attention mechanisms and eliminates recurrence, leading to significant advances in sequence modeling.