Masterclass
Massive datasets scraped from sources like Common Crawl inevitably contain redundant content. While some repetition might reflect natural language patterns, excessive duplication, both exact and near, can negatively impact LLM training. It wastes computational resources processing the same information multiple times, potentially skews the model's understanding of data distributions, and can amplify biases present in the repeated content. Removing duplicates is therefore a standard and important step in data preprocessing.
Identifying documents that are byte-for-byte identical is relatively straightforward. The most common approach involves computing a cryptographic hash (like SHA-256 or MD5) for the content of each document. Documents sharing the same hash value are considered exact duplicates.
import hashlib
import os
def get_document_hash(doc_content):
"""Computes the SHA-256 hash of document content."""
# Ensure content is bytes
if isinstance(doc_content, str):
doc_content = doc_content.encode('utf-8')
return hashlib.sha256(doc_content).hexdigest()
# Example usage with files (adapt for in-memory data)
hashes_seen = set()
duplicates_to_remove = []
data_directory = "/path/to/your/text/files" # Replace with actual path
# In a real pipeline, you'd process distributed data streams
for filename in os.listdir(data_directory):
filepath = os.path.join(data_directory, filename)
if os.path.isfile(filepath):
try:
with open(filepath, 'rb') as f:
content = f.read()
doc_hash = get_document_hash(content)
if doc_hash in hashes_seen:
duplicates_to_remove.append(filepath)
# print(f"Found duplicate: {filepath}")
else:
hashes_seen.add(doc_hash)
except Exception as e:
print(
f"Error processing {filepath}: "
f"{e}"
)
# Now duplicates_to_remove contains paths to duplicate files
# In a large-scale system, you'd typically record hashes and process later
print(
f"Identified {len(duplicates_to_remove)} exact duplicates "
f"based on content hash."
)
While effective for identical content, this hash-based method has limitations. First, storing and comparing hashes for billions or trillions of documents requires significant memory or distributed storage and lookup infrastructure. Second, and more significantly, it fails to identify documents that are almost identical, differing only by minor formatting changes, timestamps, usernames, advertisements, or slight rephrasing. These near-duplicates still represent redundant information.
Handling near-duplicates requires moving beyond exact hashing to methods that measure content similarity. The challenge is to do this efficiently without comparing every document pair, which is computationally infeasible (O(n2) complexity for n documents). Techniques like MinHash combined with Locality-Sensitive Hashing (LSH) provide a probabilistic, scalable solution.
The first step is to transform each document into a representation suitable for similarity comparison. A common technique is shingling, where a document is broken down into a set of overlapping sequences of tokens or characters, known as shingles or k-grams.
For example, consider the sentence "the quick brown fox". Using character 3-grams (shingles of length k=3):
{"the", "he ", "e q", " qu", "qui", "uic", "ick", "ck ", "k b", " br", "bro", "row", "own", "wn ", "n f", " fo", "fox"}
Choosing the shingle type (characters vs. words) and length (k) affects granularity. Character shingles are more robust to minor word changes, while word shingles capture more semantic structure. A typical k value for character shingles might be 5 to 10.
def get_character_shingles(text, k=5):
"""Generates a set of character k-grams (shingles) for a string."""
text = text.lower() # Normalize case
shingles = set()
if len(text) < k:
return shingles # Handle short texts
for i in range(len(text) - k + 1):
shingles.add(text[i:i+k])
return shingles
doc1 = "The quick brown fox jumps over the lazy dog."
doc2 = "The quick brown fox jumped over the lazy dogs." # Slight change
doc3 = "A completely different sentence."
shingles1 = get_character_shingles(doc1, k=5)
shingles2 = get_character_shingles(doc2, k=5)
shingles3 = get_character_shingles(doc3, k=5)
print(f"Doc 1 Shingles (sample): {list(shingles1)[:5]}")
print(f"Doc 2 Shingles (sample): {list(shingles2)[:5]}")
Once documents are represented as sets of shingles (S1, S2), their similarity can be quantified using the Jaccard Similarity coefficient:
J(S1,S2)=∣S1∪S2∣∣S1∩S2∣This measures the size of the intersection of the sets divided by the size of their union. A value of 1 means the sets are identical, while 0 means they have no elements in common. Calculating this directly still requires comparing shingle sets, which can be large.
# Calculating Jaccard Similarity directly
intersection_size = len(shingles1.intersection(shingles2))
union_size = len(shingles1.union(shingles2))
jaccard_1_2 = intersection_size / union_size if union_size > 0 else 0
intersection_size_1_3 = len(shingles1.intersection(shingles3))
union_size_1_3 = len(shingles1.union(shingles3))
jaccard_1_3 = intersection_size_1_3 / union_size_1_3 if union_size_1_3 > 0 else 0
print(f"Jaccard(Doc1, Doc2): {jaccard_1_2:.4f}") # Should be high
print(f"Jaccard(Doc1, Doc3): {jaccard_1_3:.4f}") # Should be low
MinHash is a clever technique that allows us to estimate the Jaccard Similarity without explicitly computing intersections and unions of potentially massive shingle sets. It relies on hashing the shingles and observing minimum hash values.
The core idea is this: if we apply a random hash function h to all shingles in both sets S1 and S2, the probability that the minimum hash value obtained from S1 is the same as the minimum hash value obtained from S2 is equal to their Jaccard Similarity.
P(min(h(S1))=min(h(S2)))=J(S1,S2)Why does this work? Consider the combined set S1∪S2. If we randomly pick an element x from this union, the probability that x is also in the intersection S1∩S2 is exactly J(S1,S2). Applying a random hash function and picking the minimum hash value acts like randomly sampling an element based on its hash ordering. The shingle that produces the minimum hash value for the union S1∪S2 must belong to either S1 or S2 (or both). It will only yield the same minimum value for both sets if that particular minimum-hashed shingle exists in their intersection.
To get a reliable estimate, we don't use just one hash function. Instead, we use m different hash functions (h1,h2,...,hm). For each document, we compute the minimum hash value obtained across its shingles for each of these m functions. This produces a "signature" vector of m minimum hash values for each document.
Signature(S)=[min(h1(S)),min(h2(S)),...,min(hm(S))]The estimated Jaccard Similarity between two documents S1 and S2 is then simply the fraction of positions in their signature vectors where the minimum hash values match.
Libraries like datasketch
provide efficient implementations.
# Using datasketch for MinHash (Illustrative)
# pip install datasketch
from datasketch import MinHash, MinHashLSH
# Parameters (example values)
num_perm = 128 # Number of hash functions (m)
# Create MinHash objects for each document
m1 = MinHash(num_perm=num_perm)
for shingle in shingles1:
m1.update(shingle.encode('utf8'))
m2 = MinHash(num_perm=num_perm)
for shingle in shingles2:
m2.update(shingle.encode('utf8'))
m3 = MinHash(num_perm=num_perm)
for shingle in shingles3:
m3.update(shingle.encode('utf8'))
# Estimate Jaccard similarity using MinHash signatures
print(f"Estimated Jaccard(Doc1, Doc2): {m1.jaccard(m2):.4f}")
print(f"Estimated Jaccard(Doc1, Doc3): {m1.jaccard(m3):.4f}")
Even with MinHash signatures, comparing all pairs of signatures (O(n2) comparisons) is too slow for billions of documents. Locality-Sensitive Hashing (LSH) is a technique used to efficiently find candidate pairs of documents that are likely to be similar, drastically reducing the number of explicit comparisons needed.
LSH works by hashing the MinHash signatures themselves in a way that similar signatures are more likely to hash to the same bucket. The strategy involves dividing the MinHash signature (of length m) into b bands, each containing r rows (so m=b×r).
For each band, we hash the portion of the signature it contains. Two documents are considered a candidate pair if they hash to the same bucket for at least one band.
By adjusting the parameters b and r, we can tune the trade-off between the probability of finding true near-duplicates (recall) and the number of non-similar pairs flagged as candidates (precision). Increasing b (more bands, fewer rows r per band) makes the LSH criteria stricter, reducing false positives but potentially missing some true positives. Decreasing b (fewer bands, more rows r per band) makes it easier for documents to become candidates, increasing recall but also increasing the number of candidate pairs to verify.
# Using datasketch for LSH (Illustrative)
# LSH parameters
threshold = 0.8 # Example Jaccard similarity threshold
# Create LSH index - b and r are calculated
# internally by datasketch based on threshold
lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
# Index the MinHash signatures (assuming unique IDs 'doc1', 'doc2', 'doc3')
lsh.insert("doc1", m1)
lsh.insert("doc2", m2)
lsh.insert("doc3", m3)
# ... index millions/billions more
# Query for potential near-duplicates of doc1
result_doc1 = lsh.query(m1)
print(f"Candidates near doc1: {result_doc1}")
# Query for potential near-duplicates of doc3
result_doc3 = lsh.query(m3)
print(f"Candidates near doc3: {result_doc3}")
# After finding candidates via LSH, you would typically
# retrieve the full documents (or detailed hashes) for final verification
# and decide which duplicates to remove (e.g., keep the first encountered).
Implementing near-duplicate detection at the scale required for LLM pre-training datasets involves several considerations:
Effectively removing exact and near-duplicates ensures that the final training dataset is more diverse, reducing redundancy and potentially mitigating the amplification of biases present in repeated content. This leads to more efficient training and often results in models with better generalization capabilities.
© 2025 ApX Machine Learning