Masterclass
The effectiveness of large language models is directly tied to the scale and quality of the text data they are trained on. Obtaining datasets large enough, often measured in terabytes or even petabytes, is a foundational step in the LLM building process and presents its own set of engineering hurdles.
This chapter concentrates on the practical methods for sourcing and gathering these massive text collections required for pre-training. You will examine various strategies, including:
By the end of this chapter, you will have a clear understanding of the common approaches and challenges involved in acquiring the raw text data needed to begin building an LLM.
6.1 Identifying Potential Data Sources
6.2 Utilizing Common Crawl Data
6.3 Web Scraping Techniques at Scale
6.4 Leveraging Open Licensed Datasets
6.5 Data Acquisition Legal Considerations
© 2025 ApX Machine Learning