Data Quantity and Variety in Foundational Model Training
Was this section helpful?
The Pile: An 800GB Dataset of Diverse Text for Language Modeling, Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy, 2021arXiv preprint arXiv:2101.00027DOI: 10.48550/arXiv.2101.00027 - Introduces a large, diverse dataset specifically designed to improve general language modeling capabilities, highlighting the importance of data variety from multiple sources.
Language Models are Few-Shot Learners, Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, 2020Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.2005.14165 - Presents the GPT-3 model, highlighting the significant contribution of large-scale pretraining data and model size in achieving strong few-shot learning and emergent capabilities.