\"\"\" clean_content = extract_text_from_html(html_content) print(clean_content) # Main Title # This is important content.Similarly, if you are working with Markdown, you might want to strip the formatting to get to the underlying text. The strip_markdown function can be used for this purpose, converting **bold** text into just bold.Removing Irrelevant and Sensitive ContentFor many RAG applications, elements like URLs, email addresses, and phone numbers are noise that can distort the semantic meaning of a chunk. Removing them helps the embedding model focus on the primary content. More importantly, removing or redacting Personally Identifiable Information (PII) is a standard practice for privacy and data protection.You can remove these elements individually or as part of a larger cleaning pipeline. For sensitive applications, instead of just removing PII, you might choose to redact it, replacing the sensitive data with a generic placeholder.from kerb.safety import redact_pii sensitive_text = \"Please contact support at [email protected] or call 555-123-4567 for help.\" redacted_text = redact_pii(sensitive_text) print(f\"Original: {sensitive_text}\") print(f\"Redacted: {redacted_text}\") # Original: Please contact support at [email protected] or call 555-123-4567 for help. # Redacted: Please contact support at [REDACTED_EMAIL] or call [REDACTED_PHONE] for help.This approach preserves the context that a piece of information was present while protecting the sensitive data itself.Building a Preprocessing PipelineWhile you can apply these functions one by one, it's more efficient to define a reusable preprocessing pipeline. The normalize_text function, combined with a NormalizationConfig, allows you to orchestrate multiple cleaning and normalization steps in a declarative way. You can define different configurations for different types of content or stages of your RAG pipeline.Let's build a standard pipeline for processing web-scraped articles. The goal is to lowercase the text, remove URLs and emails, and normalize all whitespace and quotes.from kerb.preprocessing import normalize_text, NormalizationConfig, NormalizationLevel # Define a configuration for our pipeline web_article_config = NormalizationConfig( level=NormalizationLevel.STANDARD, # Applies standard unicode, quote, and whitespace normalization lowercase=True, remove_urls=True, remove_emails=True, remove_extra_spaces=True ) messy_article_snippet = \"\"\" Check out our new \"Report\" at https://example.com/report! Contact [email protected] for info. It's an in-depth analysis of AI’s impact. \"\"\" # Apply the entire pipeline in one call processed_text = normalize_text(messy_article_snippet, config=web_article_config) print(\"### Before Preprocessing ###\") print(messy_article_snippet) print(\"\\n### After Preprocessing ###\") print(processed_text)This \"before and after\" comparison shows the result of our pipeline:Before PreprocessingCheck out our new \"Report\" at https://example.com/report! Contact [email protected] for info. It's an in-depth analysis of AI’s impact.After Preprocessingcheck out our new \"report\" at ! contact for info. it's an in-depth analysis of ai's impact.The output is now a clean, standardized block of text. This cleaned version is far more suitable for an embedding model, as the noise has been stripped away, leaving only the core textual content. The level of preprocessing is a balance. Overly aggressive cleaning might remove important context, so it's always good practice to test the impact of your pipeline on retrieval quality.With our documents loaded, chunked, and now thoroughly cleaned, we have a high-quality dataset of text fragments. Each fragment is a semantically coherent and normalized unit of information, ready for the next step: converting text into numerical representations through embeddings.","isAccessibleForFree":false,"hasPart":{"@type":"WebPageElement","isAccessibleForFree":false,"cssSelector":".login-required-content"}}

Text Preprocessing for Better Retrieval

Was this section helpful?

References

Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, 2008 (Cambridge University Press) - A classic textbook on information retrieval, detailing how text processing and indexing are fundamental for effective search and retrieval systems.
The Unicode Standard, The Unicode Consortium, 2025 (The Unicode Consortium) - The official standard defining character encoding, including normalization forms (NFC, NFD, NFKC, NFKD) which are important for consistent text processing.