Language Models are Few-Shot Learners, Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, 2020Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Vol. 33 (Neural Information Processing Systems Foundation, Inc. (NeurIPS))DOI: 10.48550/arXiv.2005.14165 - 本文详细阐述了训练GPT-3所用的数据混合构成,包括来自Common Crawl、WebText、书籍和维基百科等不同来源的比例,并探讨了它们对模型性能的影响。
The Pile: An 800GB Dataset of Diverse Text for Language Model Training, Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Jean-Philippe Labonne, Joshua McCarty, Stefano Palmeri, Chris Pennacchio, Jason Phang, Anthony Rice, Eric Sheng, Andrea Singhal, Stephen Slater, Shawn Tabassum, Andy Tang, Anish Thite, Huu Tran, Sam Wang, Ben Wang, and Anna Zou, 2021arXiv preprint arXiv:2101.00027DOI: 10.48550/arXiv.2101.00027 - 本研究展示了一个大型且多样化的数据集,旨在通过结合22个不同的高质量来源,培养语言模型的泛化能力,从而展示了战略性数据混合构成。
PaLM: Scaling Language Modeling with Pathways, Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Julian Salazar, Marcianna Blanton, Yi Tay, Josh Adlam, Stephen Brody, Jeremy Robinson, Liam Donovan, David Luan, Noam Shazeer, Katherine Lee, Zhongtao Zheng, Quoc V. Le, Ed H. Chi, and Jeffrey Dean, 2022arXiv preprint arXiv:2204.02311DOI: 10.48550/arXiv.2204.02311 - 本文描述了PaLM高度多样化的训练数据集,包含来自网页、书籍和代码等不同领域文本,强调了数据混合在扩展语言模型中的作用。
Llama 2: Open Foundation and Fine-Tuned Chat Models, Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude F Harris, Liangsheng Hsu, Ishan Jain, Ahmed Khan, Yangkun Lin, Daouda Kebe, Marcin Korenkiewicz, Jenya Lee, Guillaume Lample, Rosy Liao, Mao Li, Eric Michael Smith, Rajneesh Kumar Singh, Utkarsh Singhal, Pararth Shah, Robert Stojnic, Andrew P. Williams, Eryk Wronkowicz, Binh Tang, Nicolas Usunier, Gabriel Synnaeve, Chloe Xu, Hu Xu, Zheng Yan, and Hongyu Zhong, 2023arXiv preprint arXiv:2307.09288DOI: 10.48550/arXiv.2307.09288 - 这项工作详细介绍了Llama 2的预训练数据构成,强调了数据质量、多样性和安全方面在开发现代大型语言模型中的相关性。