Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.) - Foundational paper introducing the Transformer architecture, detailing the decoder's final linear layer and softmax, weight sharing mechanism, and label smoothing.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Authoritative textbook covering core concepts of deep learning, including linear layers, softmax function, probability distributions, and cross-entropy loss.