Enhancing synthetic datasets requires effective data filtering and cleansing pipelines. Generating synthetic data is only the first step; ensuring its quality, relevance, and safety is essential for successful LLM pretraining and fine-tuning. This exercise will guide you through building a Python script to filter a synthetically generated text dataset, a core component of any automated data quality assurance workflow.We'll focus on creating a flexible script that can identify and remove undesirable data points based on common criteria such as length, repetitiveness, presence of placeholder text, and simple content heuristics. This hands-on experience will equip you to start building your own custom filtering pipelines tailored to your specific data and model requirements.Setting the Stage: Why Filter Synthetic Data?Synthetic data generation, while powerful, isn't always perfect. Generated samples might be too short, overly verbose, contain repetitive phrases, include leftover template markers, or even exhibit undesirable content patterns. Feeding such noisy data into your LLM can degrade performance, introduce biases, or lead to unpredictable model behavior. A well-designed filtering script acts as a quality gate, ensuring that only high-utility data proceeds to the training stage.Let's imagine we have a dataset of synthetically generated instruction-response pairs, perhaps created using one of the LLM-based methods discussed earlier. Our goal is to clean this dataset.Core Components of Our Filtering ScriptOur script will consist of several parts:Data Loading: We'll start by loading our synthetic data, typically from a JSONL file where each line is a JSON object representing a data sample (e.g., an instruction and its corresponding response).Filter Definitions: We will define a set of individual filter functions. Each function will take a data sample (or a part of it, like the response text) and return True if the sample passes the filter, and False otherwise.Filtering Logic: We'll orchestrate the application of these filters to each data point.Reporting: It's important to track which data points are filtered out and why.Saving Results: Finally, we'll save the cleaned dataset and, optionally, the discarded items for review.Below is a diagram illustrating the general flow of data through our filtering script.digraph G { graph [fontname="Arial"]; node [fontname="Arial"]; edge [fontname="Arial"]; rankdir=LR; node [shape=box, style=rounded, color="#495057", fontcolor="#495057"]; edge [color="#adb5bd"]; RawData [label="Raw Synthetic\nDataset", shape=cylinder, style="filled", fillcolor="#a5d8ff"]; LoadData [label="Load Data\n(e.g., JSONL)"]; LengthFilter [label="Length Filter", style="filled", fillcolor="#96f2d7"]; RepetitionFilter [label="Repetition Filter", style="filled", fillcolor="#96f2d7"]; PlaceholderFilter [label="Placeholder Filter", style="filled", fillcolor="#96f2d7"]; KeywordFilter [label="Keyword Filter", style="filled", fillcolor="#96f2d7"]; ComplexityFilter [label="Complexity Filter", style="filled", fillcolor="#96f2d7"]; FilteredData [label="Cleaned Synthetic\nDataset", shape=cylinder, style="filled", fillcolor="#69db7c"]; DiscardedData [label="Discarded Data\n(with reasons)", shape=cylinder, style="filled", fillcolor="#ffc9c9"]; RawData -> LoadData; LoadData -> LengthFilter; LengthFilter -> RepetitionFilter [label=" passed"]; LengthFilter -> DiscardedData [label=" failed", fontcolor="#f03e3e", color="#f03e3e"]; RepetitionFilter -> PlaceholderFilter [label=" passed"]; RepetitionFilter -> DiscardedData [label=" failed", fontcolor="#f03e3e", color="#f03e3e"]; PlaceholderFilter -> KeywordFilter [label=" passed"]; PlaceholderFilter -> DiscardedData [label=" failed", fontcolor="#f03e3e", color="#f03e3e"]; KeywordFilter -> ComplexityFilter [label=" passed"]; KeywordFilter -> DiscardedData [label=" failed", fontcolor="#f03e3e", color="#f03e3e"]; ComplexityFilter -> FilteredData [label=" passed"]; ComplexityFilter -> DiscardedData [label=" failed", fontcolor="#f03e3e", color="#f03e3e"]; }The diagram shows how raw synthetic data is loaded and sequentially passed through various filters. Data failing any filter is moved to a "Discarded Data" collection, while data passing all filters forms the "Cleaned Synthetic Dataset."Implementing the Filtering Script in PythonLet's start building our script. First, ensure you have Python installed. You might also need to install a library for text complexity metrics, like textstat. You can install it using pip:pip install textstatNow, let's create our Python script, filter_synthetic_data.py.1. Imports and Initial SetupWe'll need json for handling JSONL data, re for regular expressions (useful for placeholder detection), collections.Counter for n-gram analysis in repetition detection, and textstat for complexity scores.import json import re from collections import Counter import textstat # For Flesch Reading Ease # Configuration for filters MIN_RESPONSE_WORDS = 10 MAX_RESPONSE_WORDS = 300 MAX_TRIGRAM_REPETITION_RATIO = 0.3 # Max 30% of trigrams can be the most common one PLACEHOLDER_PATTERNS = [ r"\[insert.*here\]", r"\(your answer\)", r"__+", r"YOUR_RESPONSE_HERE" ] FORBIDDEN_KEYWORDS = ["unsafe_content", "problematic_phrase"] # Example keywords MIN_FLESCH_READING_EASE = 30.0 # Scores below 30 are considered difficult (more complex) # Helper for word count def word_count(text): return len(text.split())2. Defining Filter FunctionsEach filter function will take the relevant text (e.g., the model's response) and return True if it passes, along with a reason if it fails.Length FilterThis filter checks if the response length (in words) falls within a specified range.def filter_by_length(text, min_words, max_words): count = word_count(text) if not (min_words <= count <= max_words): return False, f"Length ({count} words) out of range [{min_words}, {max_words}]" return True, ""Repetition FilterTo detect excessive repetition, we can look at n-gram frequencies. Here, we'll check for trigram repetition. If a single trigram makes up too large a percentage of all trigrams, the text might be repetitive.def get_ngrams(text, n): words = text.lower().split() return [" ".join(words[i:i+n]) for i in range(len(words)-n+1)] def filter_by_repetition(text, n=3, max_ratio=0.3): if not text.strip(): # Avoid division by zero for empty strings return True, "" # Or False, "Empty text" depending on desired behavior ngrams = get_ngrams(text, n) if not ngrams: # Not enough words for n-grams return True, "" # Pass if too short for n-gram analysis (length filter handles short text) counts = Counter(ngrams) most_common_ngram_count = counts.most_common(1)[0][1] repetition_ratio = most_common_ngram_count / len(ngrams) if repetition_ratio > max_ratio: return False, f"High {n}-gram repetition ratio: {repetition_ratio:.2f} > {max_ratio}" return True, ""Placeholder FilterThis filter uses regular expressions to find common placeholder patterns.def filter_by_placeholder(text, patterns): for pattern in patterns: if re.search(pattern, text, re.IGNORECASE): return False, f"Placeholder found: matched '{pattern}'" return True, ""Forbidden Keyword FilterA simple filter to check for the presence of specific unwanted keywords. In practice, this would be more sophisticated, possibly using a dedicated content moderation API or model.def filter_by_keyword(text, keywords): text_lower = text.lower() for keyword in keywords: if keyword.lower() in text_lower: return False, f"Forbidden keyword found: '{keyword}'" return True, ""Complexity Filter (Flesch Reading Ease)This filter uses the Flesch Reading Ease score. Higher scores indicate easier readability. We might want to filter out responses that are too simple or, conversely, too complex, depending on the use case. Here, we filter out overly simple text if its score is above a certain threshold, or overly complex text if its score is below another. For this example, we'll ensure a minimum level of complexity.def filter_by_complexity_flesch(text, min_score): # textstat requires at least 100 words for some stats, but Flesch works on shorter text too. # It's generally more reliable on longer texts. # We might add a word count check here if desired. if word_count(text) < 5: # Flesch score can be unreliable for very short text return True, "Text too short for reliable complexity score, passing." try: score = textstat.flesch_reading_ease(text) if score < min_score: # Lower scores mean more complex text return False, f"Flesch Reading Ease score ({score:.2f}) is below minimum ({min_score}), too complex." # Example: One might also filter for text that is TOO simple: # if score > max_score: # return False, f"Flesch Reading Ease score ({score:.2f}) is above maximum ({max_score}), too simple." except Exception as e: # textstat might fail on certain edge-case inputs return True, f"Could not compute Flesch score: {e}, passing." return True, ""Note: The Flesch Reading Ease score typically ranges from 0 to 100. A score of 60-70 is considered plain English. A score of 0-30 is usually best understood by university graduates (more complex). Our MIN_FLESCH_READING_EASE = 30.0 implies we want to discard text that is very complex. Adjust logic if you want to discard overly simple text. For this example, let's re-frame: we want to ensure the text is not excessively difficult. So, a lower score means more difficult. We want scores above a certain minimum. Let's adjust the filter's logic and parameter name to reflect filtering out too complex text.Corrected interpretation for MIN_FLESCH_READING_EASE: We want text that has at least this score (i.e., not too difficult). So, if score < min_score, it's too difficult (complex), and we filter it out. The original MIN_FLESCH_READING_EASE = 30.0 means we filter out texts harder than a university graduate level. This seems like a reasonable starting point.3. Main Filtering LogicNow, let's write the main part of the script that loads data, applies these filters, and saves the results. We'll assume our input data is in a JSONL file, where each line is a JSON object like {"instruction": "...", "response": "..."}.def process_dataset(input_filepath, output_filepath_passed, output_filepath_failed): passed_items = [] failed_items_log = [] filters_to_apply = [ ("length", lambda item: filter_by_length(item['response'], MIN_RESPONSE_WORDS, MAX_RESPONSE_WORDS)), ("repetition", lambda item: filter_by_repetition(item['response'], max_ratio=MAX_TRIGRAM_REPETITION_RATIO)), ("placeholder", lambda item: filter_by_placeholder(item['response'], PLACEHOLDER_PATTERNS)), ("keyword", lambda item: filter_by_keyword(item['response'], FORBIDDEN_KEYWORDS)), ("complexity", lambda item: filter_by_complexity_flesch(item['response'], MIN_FLESCH_READING_EASE)) ] try: with open(input_filepath, 'r', encoding='utf-8') as infile: for i, line in enumerate(infile): try: item = json.loads(line) if 'response' not in item: # Basic validation failed_items_log.append({"item_index": i, "original_item": item, "reason": "Missing 'response' field"}) continue original_item_for_log = item.copy() # Keep original for logging if it fails passes_all_filters = True fail_reason = "" for filter_name, filter_func in filters_to_apply: passed, reason = filter_func(item) if not passed: passes_all_filters = False fail_reason = f"Failed {filter_name}: {reason}" break # Stop on first failure if passes_all_filters: passed_items.append(item) else: failed_items_log.append({"item_index": i, "original_item": original_item_for_log, "reason": fail_reason}) except json.JSONDecodeError: failed_items_log.append({"item_index": i, "line_content": line.strip(), "reason": "JSON Decode Error"}) continue except FileNotFoundError: print(f"Error: Input file not found at {input_filepath}") return with open(output_filepath_passed, 'w', encoding='utf-8') as outfile_passed: for item in passed_items: outfile_passed.write(json.dumps(item) + '\n') with open(output_filepath_failed, 'w', encoding='utf-8') as outfile_failed: for log_entry in failed_items_log: outfile_failed.write(json.dumps(log_entry) + '\n') print(f"Processing complete.") print(f"Total items processed: {len(passed_items) + len(failed_items_log)}") print(f"Items passed: {len(passed_items)}") print(f"Items failed: {len(failed_items_log)}") print(f"Passed items saved to: {output_filepath_passed}") print(f"Failed items log saved to: {output_filepath_failed}") if __name__ == "__main__": # Create a dummy input file for demonstration dummy_data = [ {"instruction": "Explain gravity.", "response": "Gravity is a fundamental force of attraction that acts between all objects with mass. It's what keeps planets in orbit and what makes apples fall to the ground. The more mass an object has, the stronger its gravitational pull. This concept was famously formulated by Sir Isaac Newton and later refined by Albert Einstein's theory of general relativity, which describes gravity as a curvature of spacetime caused by mass and energy."}, {"instruction": "What is 1+1?", "response": "Two."}, # Fails length {"instruction": "Tell me a joke.", "response": "Why did the chicken cross the road? Why did the chicken cross the road? Why did the chicken cross the road? To get to the other side."}, # Fails repetition {"instruction": "Summarize this document.", "response": "[insert summary here] please provide the document."}, # Fails placeholder {"instruction": "Describe a cat.", "response": "A cat is a small carnivorous mammal. It is the only domesticated species in the family Felidae and has been cohabiting with humans for at least 9,500 years. Cats are valued by humans for companionship and their ability to hunt vermin. This is an example of unsafe_content that should be filtered."}, # Fails keyword {"instruction": "Define photosynthesis.", "response": "Photosynthesis is the process used by plants, algae, and certain bacteria to convert energy from sunlight into chemical energy. This is very important for life on Earth. The equation is complex. It's really really really critical."}, # Might fail complexity or pass. {"instruction": "Another short one", "response": "ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok ok"}, # Fails length and repetition {"instruction": "What is Python?", "response": "Python is a high-level, interpreted programming language known for its readability and versatility. It supports multiple programming approaches and has a standard library, making it suitable for web development, data science, artificial intelligence, and more. It's good."} # Should pass ] input_file = "dummy_synthetic_data.jsonl" with open(input_file, 'w', encoding='utf-8') as f: for entry in dummy_data: f.write(json.dumps(entry) + '\n') output_passed_file = "filtered_data_passed.jsonl" output_failed_file = "filtered_data_failed_log.jsonl" process_dataset(input_file, output_passed_file, output_failed_file)4. Running the ScriptSave the complete code as filter_synthetic_data.py. When you run it from your terminal:python filter_synthetic_data.pyIt will:Create a dummy_synthetic_data.jsonl file.Process this file using the defined filters.Create filtered_data_passed.jsonl with items that passed all filters.Create filtered_data_failed_log.jsonl detailing which items failed and why.You can then inspect the output files to see the filtering in action. For example, the filtered_data_failed_log.jsonl might contain entries like:{"item_index": 1, "original_item": {"instruction": "What is 1+1?", "response": "Two."}, "reason": "Failed length: Length (1 words) out of range [10, 300]"} {"item_index": 2, "original_item": {"instruction": "Tell me a joke.", "response": "Why did the chicken cross the road? Why did the chicken cross the road? Why did the chicken cross the road? To get to the other side."}, "reason": "Failed repetition: High 3-gram repetition ratio: 0.33 > 0.3"}Discussion and Iterative RefinementThis script provides a solid foundation for filtering synthetic data. Here are some ways you can extend and improve it as part of an iterative refinement process:More Sophisticated Filters:Perplexity-based filtering: Use another LLM to score the perplexity of generated text. Unusually high or low perplexity can indicate poor quality.Semantic Similarity: Filter out responses that are too similar to their prompts or to other responses in the dataset (to improve diversity).Factuality Checking: Integrate external knowledge bases or models to verify factual claims, though this is a complex area.Toxicity and Bias Classifiers: Use specialized models to detect and filter harmful content more reliably than simple keyword matching.Configuration Management: Move filter parameters (thresholds, keywords, patterns) to a separate configuration file (e.g., YAML or JSON) for easier management.Modularity: Organize filters into classes or separate modules for better code structure, especially as the number of filters grows.Integration into Pipelines: This script can be a single step in a larger data processing pipeline (e.g., using Apache Airflow, Kubeflow Pipelines, or custom scripting) that automates generation, filtering, formatting, and training.Impact Analysis: Always analyze the impact of your filtering strategy. How much data is being discarded? Are you inadvertently removing valuable, albeit slightly imperfect, data? Test the performance of LLMs trained on data filtered with different stringencies to find the optimal balance.Building effective data filtering pipelines is an ongoing process. As you gain more experience with your specific synthetic data generation methods and observe the failure modes, you will continuously refine your filters and thresholds. This iterative approach is significant for maintaining high data quality and, ultimately, building more capable and reliable Large Language Models.