Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS) 30 (Curran Associates, Inc.) - The foundational paper introducing the Transformer architecture, detailing the Position-wise Feed-Forward Network's structure, role, and original parameters.
Transformers for Natural Language Processing: From GPT-2 to BERT and Beyond, Karthikeyan Vijayakumar, Sudharsan Ravichandiran, and Vignesh Jaganathan, 2023 (Packt Publishing) - A comprehensive book offering an accessible explanation of Transformer components, including a dedicated discussion of FFNs within the overall architecture.
Gaussian Error Linear Units (GELUs), Dan Hendrycks and Kevin Gimpel, 2017International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1606.08415 - Introduces the GeLU activation function, noted in the section content as an alternative to ReLU in modern Transformer models.