Masterclass
Web pages sourced from large crawls like Common Crawl are typically encoded in HTML, which includes not only the primary text content but also structural markup, styling information (CSS), interactive elements (JavaScript), navigation menus, advertisements, headers, footers, and other peripheral text often referred to as "boilerplate". Leaving this extraneous material in the training data can introduce noise, dilute the signal from high-quality text, and skew the model's understanding of natural language structure. Therefore, effectively removing markup and boilerplate is a standard step in preparing web data for LLM pre-training.
The first step in processing HTML documents is to parse their structure. Raw HTML is a string of text, but libraries can convert this into a structured representation, typically a tree, making it easier to navigate and manipulate. Python offers several excellent libraries for this purpose. Beautiful Soup
is widely used for its flexibility and ease of use, particularly in handling imperfect or malformed HTML often found on the web. lxml
is another powerful option, known for its speed, often used as a backend parser for Beautiful Soup
.
Here's a basic example using Beautiful Soup
to parse an HTML string and extract all the text content, stripping away the tags:
from bs4 import BeautifulSoup
import torch # Placeholder for potential future PyTorch integration
# Sample HTML content (often much more complex in reality)
html_doc = """
<html><head><title>Sample Page</title>
<style>body {font-size: 12px;}</style></head>
<body>
<header><h1>Main Title</h1><nav><a>Home</a> <a>About</a></nav></header>
<main>
<p>This is the primary article content we want to keep.
It discusses important topics.</p>
<p>Another paragraph of useful information.</p>
<script>console.log('Some script');</script>
</main>
<footer><p>Copyright 2023. Some footer links.</p></footer>
</body></html>
"""
soup = BeautifulSoup(html_doc, 'lxml') # Using the lxml parser
# Get all text, stripping tags
all_text = soup.get_text(separator=' ', strip=True)
print("All Text (including boilerplate):")
print(all_text)
# Output: Sample Page Main Title Home About This is the primary article
# content we want to keep. It discusses important topics. Another
# paragraph of useful information. console.log('Some script'); Copyright
# 2023. Some footer links.
# Attempt to get only main content text (simple example)
main_content_tag = soup.find('main')
if main_content_tag:
main_text = main_content_tag.get_text(separator=' ', strip=True)
print("\nMain Content Text (simple extraction):")
print(main_text)
# Output: This is the primary article content we want to keep. It
# discusses important topics. Another paragraph of useful
# information. console.log('Some script');
else:
print("\n'main' tag not found.")
As the example shows, simply calling get_text()
on the whole document often includes unwanted text from headers, footers, and potentially scripts if they contain text nodes. While finding specific tags like <main>
can help, this relies on semantic HTML usage, which isn't always consistent across websites. Notice also that the simple extraction above still included the content of the <script>
tag within <main>
.
A common starting point is to use rule-based methods. This involves identifying HTML tags commonly associated with boilerplate and removing them and their contents entirely before extracting text. Common targets include:
<script>
: Contains JavaScript code.<style>
: Contains CSS styling rules.<nav>
: Typically holds site navigation links.<header>
: Often contains logos, site titles, and top navigation.<footer>
: Usually includes copyright notices, contact information, and bottom links.<aside>
: Represents sidebars or content tangentially related to the main content.<iframe>
tags from ad networks).We can use Beautiful Soup
's decompose()
method to remove these elements from the parsed tree before text extraction.
from bs4 import BeautifulSoup
import torch # Placeholder
html_doc = """
<html><head><title>Sample Page</title>
<style>body {font-size: 12px;}</style></head>
<body>
<header><h1>Main Title</h1><nav><a>Home</a> <a>About</a></nav></header>
<main>
<p>This is the primary article content we want to keep.
It discusses important topics.</p>
<div class="sidebar"><p>Related Links</p></div>
<p>Another paragraph of useful information.</p>
<script>console.log('Some script');</script>
</main>
<footer><p>Copyright 2023. Some footer links.</p></footer>
</body></html>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# Tags to remove
tags_to_remove = ['script', 'style', 'nav', 'header', 'footer', 'aside']
# Also remove elements identified by common boilerplate classes/ids (example)
selectors_to_remove = ['.sidebar', '#ads']
for tag_name in tags_to_remove:
for tag in soup.find_all(tag_name):
tag.decompose() # Remove the tag and its contents
for selector in selectors_to_remove:
for tag in soup.select(selector): # Use CSS selectors
tag.decompose()
# Extract text from the modified soup
cleaned_text = soup.get_text(separator=' ', strip=True)
print("Text after Rule-Based Removal:")
print(cleaned_text)
# Output: Sample Page This is the primary article content we want to keep.
# It discusses important topics. Another paragraph of useful information.
While effective for removing obvious boilerplate, rule-based methods have limitations. They can be brittle; websites structure content differently, and relying on specific tags like <main>
or classes like sidebar
isn't universally reliable. Aggressive removal might also discard useful information if, for instance, a figure caption is placed within a <footer>
tag. Maintaining comprehensive rules across the diversity of the web is challenging.
More advanced techniques focus on identifying the main content block(s) of a page rather than just removing known boilerplate tags. These algorithms often work on the principle that primary content typically has a higher density of text relative to the amount of HTML markup within its section, compared to navigation menus or footers which have many links (markup) but less contiguous text.
Libraries like trafilatura
are specifically designed for this task, often incorporating heuristics about text density, link density, tag patterns, and document structure. They aim to robustly extract the core article text from news pages, blogs, and other web documents while discarding surrounding boilerplate.
Using trafilatura
is generally straightforward:
import trafilatura
import torch # Placeholder
# Assuming 'html_doc' contains the raw HTML string from previous examples
# For trafilatura, it's often better to feed it the raw HTML string
# Example with the second HTML document
html_doc = """
<html><head><title>Sample Page</title><style>body {font-size: 12px;}</style></head>
<body>
<header><h1>Main Title</h1><nav><a>Home</a> <a>About</a></nav></header>
<main>
<p>This is the primary article content we want to keep. It discusses important topics.</p>
<div class="sidebar"><p>Related Links</p></div>
<p>Another paragraph of useful information.</p>
<script>console.log('Some script');</script>
</main>
<footer><p>Copyright 2023. Some footer links.</p></footer>
</body></html>
"""
# Setting include_comments=False is typical for LLM data
# favor_recall=True might include slightly more text, potentially borderline
# boilerplate
# favor_precision=True is stricter, might miss some main content
extracted_text = trafilatura.extract(
html_doc,
include_comments=False,
include_tables=True
)
print("Text extracted with Trafilatura:")
if extracted_text:
print(extracted_text)
else:
print("Trafilatura could not extract main content.")
# Expected Output (may vary slightly based on library
# version/heuristics):
# Main Title # Trafilatura sometimes includes the H1 title if close to
# main text
# This is the primary article content we want to keep. It discusses
# important topics.
# Another paragraph of useful information.
These tools often provide a good balance between accuracy and computational efficiency for large-scale processing. They encapsulate complex heuristics derived from analyzing many web pages, saving you from developing and maintaining intricate rule sets yourself.
trafilatura
that work on static HTML source won't see this dynamically loaded content. Handling such sites requires rendering the page in a headless browser (e.g., using tools like Selenium
, Playwright
, or Puppeteer
), which significantly increases processing time and resource consumption. This is often impractical for processing billions of web pages and is a reason why datasets derived from sources like Common Crawl might underrepresent content from highly dynamic sites.PyPDF2
or Apache Tika
) before text cleaning can proceed.trafilatura
often have some language awareness, but performance can vary.Boilerplate and markup removal is typically performed early in the data processing pipeline, often immediately after fetching the raw HTML content and before computationally heavier steps like near-duplicate detection or quality filtering. By removing irrelevant markup and text first, subsequent steps operate on smaller, cleaner data, improving efficiency and effectiveness. For instance, calculating text quality scores or detecting duplicates is more meaningful on the extracted main content rather than the full HTML source including navigation links and advertisements. This step, whether rule-based or using dedicated libraries, should be designed for scalability, capable of processing documents in parallel within a distributed framework like Apache Spark or Dask, as discussed later in the course.
© 2025 ApX Machine Learning