Choice of Activation Functions (ReLU, GeLU, SwiGLU)
Was this section helpful?
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - The foundational paper introducing the Transformer architecture, including the structure and role of the feed-forward network.
Gaussian Error Linear Units (GELUs), Dan Hendrycks, Kevin Gimpel, 2016arXiv preprintDOI: 10.48550/arXiv.1606.08415 - The original research paper introducing the Gaussian Error Linear Unit (GeLU) activation function, which became a standard choice in many Transformer models.