Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.) - 介绍了Transformer架构,为与稀疏激活模型对比的密集前馈网络奠定了基础。
Sparsely-Gated Mixture-of-Experts Layers, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017International Conference on Learning RepresentationsDOI: 10.48550/arXiv.1701.06538 - 提出了深度学习中专家混合层(MoE)的现代公式,详细说明了稀疏激活及其在不增加每token计算量的情况下扩展参数数量的优点。