Constructing Large-Scale Synthetic Corpora for Pretraining
Was this section helpful?
Textbooks Are All You Need II: phi-1.5 technical report, Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee, 2023arXiv preprint arXiv:2309.05463DOI: 10.48550/arXiv.2309.05463 - Introduces the concept of 'textbook-quality' data, a combination of synthetically generated text and carefully filtered web data, to pretrain small language models with strong reasoning and coding abilities.
Unsupervised Data Augmentation for Consistency Training, Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le, 2020Advances in Neural Information Processing Systems - Describes data augmentation techniques like back-translation and TF-IDF based word replacement, showing their effectiveness in semi-supervised learning for NLP tasks, relevant for increasing data diversity.
Deduplicating Training Data Makes Language Models Better, Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, Nicholas Carlini, 2022ACL 2022DOI: 10.48550/arXiv.2107.06499 - Investigates the impact of data deduplication on large language model pretraining, demonstrating that removing near-duplicates leads to improved model performance and training efficiency.
On the Dangers of Implicit Bias in LLM-Generated Text, Andrea Lampis, Eugenio Lomurno, Matteo Matteucci, 2023arXiv preprint arXiv:2305.10118DOI: 10.48550/arXiv.2305.10118 - Examines how implicit biases in LLMs can manifest in generated text and discusses the implications, highlighting the importance of bias mitigation strategies during synthetic corpus construction.