Scaling Laws for Neural Language Models, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, 2020arXiv preprint arXiv:2001.08361DOI: 10.48550/arXiv.2001.08361 - A seminal paper establishing the quantitative relationships between model size, computational budget, and dataset size for language model performance.
Language Models are Few-Shot Learners, Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, 2020Advances in Neural Information Processing Systems, Vol. 33 (NeurIPS)DOI: 10.5591/978-0-9799981-0-1.vol33.abs175 - Introduces GPT-3, a model with 175 billion parameters, demonstrating how larger models can achieve strong performance with few examples (few-shot learning).
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A foundational textbook providing a thorough explanation of neural networks, including the role of parameters in model learning and representation.
A Survey of Large Language Models, Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen, 2023arXiv preprint arXiv:2303.18223DOI: 10.48550/arXiv.2303.18223 - A recent survey offering a current overview of large language models, their architectures, training methodologies, capabilities, and associated challenges.