In the previous sections, we examined advanced strategies for enhancing synthetic datasets, including the critical need for data filtering and cleansing pipelines. Generating vast amounts of synthetic data is only the first step; ensuring its quality, relevance, and safety is essential for successful LLM pretraining and fine-tuning. This practical exercise will guide you through building a Python script to filter a synthetically generated text dataset, a core component of any automated data quality assurance workflow.
We'll focus on creating a flexible script that can identify and remove undesirable data points based on common criteria such as length, repetitiveness, presence of placeholder text, and simple content heuristics. This hands-on experience will equip you to start building your own custom filtering pipelines tailored to your specific data and model requirements.
Synthetic data generation, while powerful, isn't always perfect. Generated samples might be too short, overly verbose, contain repetitive phrases, include leftover template markers, or even exhibit undesirable content patterns. Feeding such noisy data into your LLM can degrade performance, introduce biases, or lead to unpredictable model behavior. A well-designed filtering script acts as a quality gate, ensuring that only high-utility data proceeds to the training stage.
Let's imagine we have a dataset of synthetically generated instruction-response pairs, perhaps created using one of the LLM-based methods discussed earlier. Our goal is to clean this dataset.
Our script will consist of several parts:
True
if the sample passes the filter, and False
otherwise.Below is a diagram illustrating the general flow of data through our filtering script.
The diagram shows how raw synthetic data is loaded and sequentially passed through various filters. Data failing any filter is moved to a "Discarded Data" collection, while data passing all filters forms the "Cleaned Synthetic Dataset."
Let's start building our script. First, ensure you have Python installed. You might also need to install a library for text complexity metrics, like textstat
. You can install it using pip:
pip install textstat
Now, let's create our Python script, filter_synthetic_data.py
.
We'll need json
for handling JSONL data, re
for regular expressions (useful for placeholder detection), collections.Counter
for n-gram analysis in repetition detection, and textstat
for complexity scores.
import json
import re
from collections import Counter
import textstat # For Flesch Reading Ease
# Configuration for filters
MIN_RESPONSE_WORDS = 10
MAX_RESPONSE_WORDS = 300
MAX_TRIGRAM_REPETITION_RATIO = 0.3 # Max 30% of trigrams can be the most common one
PLACEHOLDER_PATTERNS = [
r"\[insert.*here\]", r"\(your answer\)", r"__+", r"YOUR_RESPONSE_HERE"
]
FORBIDDEN_KEYWORDS = ["unsafe_content", "problematic_phrase"] # Example keywords
MIN_FLESCH_READING_EASE = 30.0 # Scores below 30 are considered difficult (more complex)
# Helper for word count
def word_count(text):
return len(text.split())
Each filter function will take the relevant text (e.g., the model's response) and return True
if it passes, along with a reason if it fails.
This filter checks if the response length (in words) falls within a specified range.
def filter_by_length(text, min_words, max_words):
count = word_count(text)
if not (min_words <= count <= max_words):
return False, f"Length ({count} words) out of range [{min_words}, {max_words}]"
return True, ""
To detect excessive repetition, we can look at n-gram frequencies. Here, we'll check for trigram repetition. If a single trigram makes up too large a percentage of all trigrams, the text might be repetitive.
def get_ngrams(text, n):
words = text.lower().split()
return [" ".join(words[i:i+n]) for i in range(len(words)-n+1)]
def filter_by_repetition(text, n=3, max_ratio=0.3):
if not text.strip(): # Avoid division by zero for empty strings
return True, "" # Or False, "Empty text" depending on desired behavior
ngrams = get_ngrams(text, n)
if not ngrams: # Not enough words for n-grams
return True, "" # Pass if too short for n-gram analysis (length filter handles short text)
counts = Counter(ngrams)
most_common_ngram_count = counts.most_common(1)[0][1]
repetition_ratio = most_common_ngram_count / len(ngrams)
if repetition_ratio > max_ratio:
return False, f"High {n}-gram repetition ratio: {repetition_ratio:.2f} > {max_ratio}"
return True, ""
This filter uses regular expressions to find common placeholder patterns.
def filter_by_placeholder(text, patterns):
for pattern in patterns:
if re.search(pattern, text, re.IGNORECASE):
return False, f"Placeholder found: matched '{pattern}'"
return True, ""
A simple filter to check for the presence of specific unwanted keywords. In practice, this would be more sophisticated, possibly using a dedicated content moderation API or model.
def filter_by_keyword(text, keywords):
text_lower = text.lower()
for keyword in keywords:
if keyword.lower() in text_lower:
return False, f"Forbidden keyword found: '{keyword}'"
return True, ""
This filter uses the Flesch Reading Ease score. Higher scores indicate easier readability. We might want to filter out responses that are too simple or, conversely, too complex, depending on the use case. Here, we filter out overly simple text if its score is above a certain threshold, or overly complex text if its score is below another. For this example, we'll ensure a minimum level of complexity.
def filter_by_complexity_flesch(text, min_score):
# textstat requires at least 100 words for some stats, but Flesch works on shorter text too.
# It's generally more reliable on longer texts.
# We might add a word count check here if desired.
if word_count(text) < 5: # Flesch score can be unreliable for very short text
return True, "Text too short for reliable complexity score, passing."
try:
score = textstat.flesch_reading_ease(text)
if score < min_score: # Lower scores mean more complex text
return False, f"Flesch Reading Ease score ({score:.2f}) is below minimum ({min_score}), too complex."
# Example: One might also filter for text that is TOO simple:
# if score > max_score:
# return False, f"Flesch Reading Ease score ({score:.2f}) is above maximum ({max_score}), too simple."
except Exception as e:
# textstat might fail on certain edge-case inputs
return True, f"Could not compute Flesch score: {e}, passing."
return True, ""
Note: The Flesch Reading Ease score typically ranges from 0 to 100. A score of 60-70 is considered plain English. A score of 0-30 is usually best understood by university graduates (more complex). Our MIN_FLESCH_READING_EASE = 30.0
implies we want to discard text that is very complex. Adjust logic if you want to discard overly simple text. For this example, let's re-frame: we want to ensure the text is not excessively difficult. So, a lower score means more difficult. We want scores above a certain minimum. Let's adjust the filter's logic and parameter name to reflect filtering out too complex text.
Corrected interpretation for MIN_FLESCH_READING_EASE
: We want text that has at least this score (i.e., not too difficult).
So, if score < min_score
, it's too difficult (complex), and we filter it out. The original MIN_FLESCH_READING_EASE = 30.0
means we filter out texts harder than a university graduate level. This seems like a reasonable starting point.
Now, let's write the main part of the script that loads data, applies these filters, and saves the results. We'll assume our input data is in a JSONL file, where each line is a JSON object like {"instruction": "...", "response": "..."}
.
def process_dataset(input_filepath, output_filepath_passed, output_filepath_failed):
passed_items = []
failed_items_log = []
filters_to_apply = [
("length", lambda item: filter_by_length(item['response'], MIN_RESPONSE_WORDS, MAX_RESPONSE_WORDS)),
("repetition", lambda item: filter_by_repetition(item['response'], max_ratio=MAX_TRIGRAM_REPETITION_RATIO)),
("placeholder", lambda item: filter_by_placeholder(item['response'], PLACEHOLDER_PATTERNS)),
("keyword", lambda item: filter_by_keyword(item['response'], FORBIDDEN_KEYWORDS)),
("complexity", lambda item: filter_by_complexity_flesch(item['response'], MIN_FLESCH_READING_EASE))
]
try:
with open(input_filepath, 'r', encoding='utf-8') as infile:
for i, line in enumerate(infile):
try:
item = json.loads(line)
if 'response' not in item: # Basic validation
failed_items_log.append({"item_index": i, "original_item": item, "reason": "Missing 'response' field"})
continue
original_item_for_log = item.copy() # Keep original for logging if it fails
passes_all_filters = True
fail_reason = ""
for filter_name, filter_func in filters_to_apply:
passed, reason = filter_func(item)
if not passed:
passes_all_filters = False
fail_reason = f"Failed {filter_name}: {reason}"
break # Stop on first failure
if passes_all_filters:
passed_items.append(item)
else:
failed_items_log.append({"item_index": i, "original_item": original_item_for_log, "reason": fail_reason})
except json.JSONDecodeError:
failed_items_log.append({"item_index": i, "line_content": line.strip(), "reason": "JSON Decode Error"})
continue
except FileNotFoundError:
print(f"Error: Input file not found at {input_filepath}")
return
with open(output_filepath_passed, 'w', encoding='utf-8') as outfile_passed:
for item in passed_items:
outfile_passed.write(json.dumps(item) + '\n')
with open(output_filepath_failed, 'w', encoding='utf-8') as outfile_failed:
for log_entry in failed_items_log:
outfile_failed.write(json.dumps(log_entry) + '\n')
print(f"Processing complete.")
print(f"Total items processed: {len(passed_items) + len(failed_items_log)}")
print(f"Items passed: {len(passed_items)}")
print(f"Items failed: {len(failed_items_log)}")
print(f"Passed items saved to: {output_filepath_passed}")
print(f"Failed items log saved to: {output_filepath_failed}")
if __name__ == "__main__":
# Create a dummy input file for demonstration
dummy_data = [
{"instruction": "Explain gravity.", "response": "Gravity is a fundamental force of attraction that acts between all objects with mass. It's what keeps planets in orbit and what makes apples fall to the ground. The more mass an object has, the stronger its gravitational pull. This concept was famously formulated by Sir Isaac Newton and later refined by Albert Einstein's theory of general relativity, which describes gravity as a curvature of spacetime caused by mass and energy."},
{"instruction": "What is 1+1?", "response": "Two."}, # Fails length
{"instruction": "Tell me a joke.", "response": "Why did the chicken cross the road? Why did the chicken cross the road? Why did the chicken cross the road? To get to the other side."}, # Fails repetition
{"instruction": "Summarize this document.", "response": "[insert summary here] please provide the document."}, # Fails placeholder
{"instruction": "Describe a cat.", "response": "A cat is a small carnivorous mammal. It is the only domesticated species in the family Felidae and has been cohabiting with humans for at least 9,500 years. Cats are valued by humans for companionship and their ability to hunt vermin. This is an example of unsafe_content that should be filtered."}, # Fails keyword
{"instruction": "Define photosynthesis.", "response": "Photosynthesis is the process used by plants, algae, and certain bacteria to convert energy from sunlight into chemical energy. This is very important for life on Earth. The equation is complex. It's really really really critical."}, # Might fail complexity or pass.
{"instruction": "Another short one", "response": "ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok"}, # Fails length and repetition
{"instruction": "What is Python?", "response": "Python is a high-level, interpreted programming language known for its readability and versatility. It supports multiple programming paradigms and has a vast standard library, making it suitable for web development, data science, artificial intelligence, and more. It's good."} # Should pass
]
input_file = "dummy_synthetic_data.jsonl"
with open(input_file, 'w', encoding='utf-8') as f:
for entry in dummy_data:
f.write(json.dumps(entry) + '\n')
output_passed_file = "filtered_data_passed.jsonl"
output_failed_file = "filtered_data_failed_log.jsonl"
process_dataset(input_file, output_passed_file, output_failed_file)
Save the complete code as filter_synthetic_data.py
. When you run it from your terminal:
python filter_synthetic_data.py
It will:
dummy_synthetic_data.jsonl
file.filtered_data_passed.jsonl
with items that passed all filters.filtered_data_failed_log.jsonl
detailing which items failed and why.You can then inspect the output files to see the filtering in action. For example, the filtered_data_failed_log.jsonl
might contain entries like:
{"item_index": 1, "original_item": {"instruction": "What is 1+1?", "response": "Two."}, "reason": "Failed length: Length (1 words) out of range [10, 300]"}
{"item_index": 2, "original_item": {"instruction": "Tell me a joke.", "response": "Why did the chicken cross the road? Why did the chicken cross the road? Why did the chicken cross the road? To get to the other side."}, "reason": "Failed repetition: High 3-gram repetition ratio: 0.33 > 0.3"}
This script provides a solid foundation for filtering synthetic data. Here are some ways you can extend and improve it as part of an iterative refinement process:
Building effective data filtering pipelines is an ongoing process. As you gain more experience with your specific synthetic data generation methods and observe the failure modes, you will continuously refine your filters and thresholds. This iterative approach is significant for maintaining high data quality and, ultimately, building more capable and reliable Large Language Models.
Was this section helpful?
© 2025 ApX Machine Learning