Masterclass
As introduced earlier, acquiring datasets vast enough for LLM pre-training is a significant engineering challenge. Among the most substantial and widely used resources for this purpose is the Common Crawl (CC) corpus. It represents a snapshot of a significant portion of the public web, gathered through large-scale web crawling operations. Common Crawl makes petabytes of raw web page data, metadata, and extracted text available publicly, typically releasing a new crawl dataset every month or two. Its sheer scale makes it an indispensable, albeit challenging, resource for training state-of-the-art language models.
Common Crawl is an open repository of web crawl data. Think of it as a massive library containing copies of billions of web pages collected over many years. This data is stored on Amazon Web Services (AWS) Simple Storage Service (S3) and is accessible to anyone, although data transfer costs often apply under the "requester pays" model.
The data within each crawl is primarily organized into three formats:
The most direct way to access the bulk data is via AWS S3. Each crawl has a unique identifier (e.g., CC-MAIN-2023-50
), and its data is organized within a corresponding S3 bucket (e.g., s3://commoncrawl/
). You can use standard AWS tools like the AWS Command Line Interface (CLI) or SDKs (like boto3
for Python) to list and download files.
# Example: Listing WET file paths for a specific crawl segment
aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700495631538.95/wet/ --request-payer requester
Remember the --request-payer requester
flag is necessary because the Common Crawl bucket is configured so that the person requesting the data pays for the download bandwidth. Downloading terabytes or petabytes can incur substantial costs.
Since WET files contain extracted plain text, they are often the most convenient format to work with. A typical WET file (.wet.gz
) is a gzipped archive containing multiple text records, each corresponding to a web page. Each record starts with WARC headers providing metadata (like the target URI and content length) followed by the extracted plain text content.
Here's a Python snippet using the warcio
library (which simplifies reading WARC-formatted files, including WET) to iterate through records in a WET file:
import gzip
from warcio.archiveiterator import ArchiveIterator
import io
# Assume wet_file_path points to a downloaded .wet.gz file
# In practice, you'd likely stream this from S3 or a distributed filesystem
try:
with open(wet_file_path, 'rb') as stream:
# Use gzip.open for transparent decompression if the file is gzipped
# Note: warcio can often handle gzip decompression directly
# if reading from a stream
# For local files, explicitly using gzip is robust.
with gzip.open(stream, 'rb') as compressed_stream:
# Create an in-memory bytes stream for warcio
# This avoids issues if the underlying stream is not seekable
# For very large files, more sophisticated streaming might be needed
bytes_stream = io.BytesIO(compressed_stream.read())
for record in ArchiveIterator(bytes_stream):
# Check if the record is a conversion (text extraction) record
if record.rec_type == 'conversion':
# record.rec_headers contains metadata like WARC-Target-URI
uri = record.rec_headers.get_header('WARC-Target-URI')
# Read the payload (extracted text) and decode it
content_bytes = record.content_stream().read()
try:
content_text = content_bytes.decode('utf-8')
# --- Your text processing logic goes here ---
# Example: Basic filtering and printing
if len(content_text.strip()) > 100: # Filter very short texts
print(f"URI: {uri}")
print(f"Content Length: {len(content_text)}")
# print(content_text[:500] + "...") # Print snippet
# Add to your dataset, perform cleaning, etc.
except UnicodeDecodeError:
# Handle potential decoding errors if content isn't
# valid UTF-8
print(f"Skipping record for URI {uri} "
f"due to decoding error.")
continue
except FileNotFoundError:
print(f"Error: File not found at {wet_file_path}")
except Exception as e:
print(f"An error occurred: {e}")
A simplified overview of processing a WET file using Python and the
warcio
library.
Processing the entire Common Crawl dataset requires distributed computing frameworks like Apache Spark or Dask, running on a cluster with access to S3. These frameworks allow you to parallelize the reading, parsing, and filtering of WET files across many machines, which is essential given the petabyte scale. We will discuss building such scalable pipelines in Chapter 7.
Downloading entire multi-terabyte segments just to find data from specific websites or types of pages is inefficient. Common Crawl provides an index service (CC-Index) that allows you to query the metadata of the crawls without downloading the large WARC/WET files first.
The index contains information about each crawled URL, including its MIME type, timestamp, location within the corresponding WARC file, and status code. You can query this index using tools like cdx-toolkit
or by making direct HTTP requests to the CDX API endpoint.
For example, you could query the index to find all pages crawled under a specific domain (example.com
) or all pages identified as having a specific language. This allows for more targeted data acquisition. Once you have the WARC file location (filename and offset) from the index query, you can perform a ranged HTTP GET request to fetch only the specific WARC record you need, significantly reducing download volume if you only need a subset of the data.
# Example using cdx-toolkit (requires installation)
# Find English pages under 'example.com' in a specific crawl
# cdx-toolkit "http://index.commoncrawl.org/CC-MAIN-2023-50-index" \
# --url "example.com/*" \
# --output-fields url,warc_filename,warc_record_offset,warc_record_length,content_language \
# --filter "content_language == eng" \
# --limit 10
While Common Crawl offers immense scale, the raw data, even in WET format, is inherently noisy and requires substantial cleaning. Common issues include:
A typical filtering pipeline applied to Common Crawl WET data before using it for LLM training.
Addressing these issues typically involves building a multi-stage pipeline incorporating language identification, heuristic-based quality filtering (e.g., based on text length, symbol ratios, stopword frequencies), and near-duplicate detection techniques (like MinHash). These important preprocessing steps are the focus of Chapter 7.
In summary, Common Crawl is a foundational resource for sourcing web data at the scale required for LLM pre-training. Accessing and processing its WET files provides a direct path to large quantities of text, but requires navigating AWS S3, managing costs, and implementing robust distributed processing and cleaning pipelines to handle the inherent noise and variability of web content. Understanding how to effectively leverage Common Crawl is a significant step in the data acquisition phase of LLM development.
© 2025 ApX Machine Learning