Hugging Face Transformers Library Documentation, Hugging Face, 2024 - A guide for using pre-trained language models for generation and various NLP tasks, a fundamental tool for synthetic data pipelines.
Hugging Face Datasets Library Documentation, Hugging Face, 2024 - Documentation for efficient data loading, processing, and sharing, important for managing LLM training data.
MLflow Documentation, MLflow Community, 2024 - Official documentation for managing the machine learning lifecycle, including experiment tracking and model management, for reproducible synthetic data projects.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell, 2021Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT '21) (Association for Computing Machinery)DOI: 10.1145/3442188.3445922 - An influential paper discussing the ethical and societal risks of large language models, including bias amplification, relevant for early project planning and mitigation strategies.