Generating Instruction-Style Data for Pretraining Phases
Was this section helpful?
Self-Instruct: Aligning Language Models with Self-Generated Instructions, Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi, 2023Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), Vol. 1DOI: 10.48550/arXiv.2212.10560 - Presents a method for using an LLM to generate its own instruction-following data, providing a scalable approach for creating synthetic instruction-style examples.
Textbooks Are All You Need, Zui Chen, Lei Cao, Sam Madden, 2023arXiv preprint arXiv:2306.11696DOI: 10.48550/arXiv.2306.11696 - Highlights the significance of high-quality, curated, and synthetic data, including instruction-style data generated by LLMs, for efficient and effective LLM pretraining.