Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.)DOI: 10.5555/3295222.3295349 - 这篇基础论文介绍了Transformer架构,详细阐述了残差连接和层归一化在其编码器-解码器模块中的集成。
Deep Residual Learning for Image Recognition, Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, 2016Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE)DOI: 10.1109/CVPR.2016.90 - 这篇论文介绍了残差网络(ResNet),首次提出了残差连接的概念,通过缓解梯度消失问题,使得训练非常深的神经网络成为可能。
Layer Normalization, Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton, 2016arXiv preprint arXiv:1607.06450DOI: 10.48550/arXiv.1607.06450 - 这篇论文提出了层归一化,一种在层内跨特征维度归一化激活值的技术,对于稳定训练循环模型和Transformer等序列模型至关重要。