Long Short-Term Memory, Sepp Hochreiter and Jürgen Schmidhuber, 1997Neural Computation, Vol. 9 (The MIT Press)DOI: 10.1162/neco.1997.9.8.1735 - Introduces Long Short-Term Memory (LSTM) networks, which mitigate vanishing gradients through specific gating mechanisms using sigmoid and tanh activation functions.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A standard textbook covering activation functions, recurrent neural networks, and the dynamics of gradient flow, offering a solid theoretical foundation.