Your RAG system's effectiveness hinges directly on the quality and accessibility of the knowledge you provide it. As mentioned in the chapter introduction, this knowledge rarely arrives in a perfectly prepared state. It typically exists across various file formats and locations. The first practical step in preparing your data is ingestion or loading: bringing the raw text content from these diverse sources into your processing pipeline.
Think of this stage as gathering your raw materials. Before you can refine gold, you first need to mine the ore. Similarly, before you can chunk and embed text for retrieval, you must first extract that text from its container, whether it's a simple text file, a structured PDF, or a dynamic web page.
Plain text files are often the simplest format to handle. Standard Python libraries provide straightforward ways to read their content.
def load_text_file(file_path):
"""Reads content from a text file."""
try:
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
return content
except FileNotFoundError:
print(f"Error: File not found at {file_path}")
return None
except Exception as e:
print(f"Error reading file {file_path}: {e}")
return None
# Example usage:
file_path = 'documents/my_document.txt'
text_content = load_text_file(file_path)
if text_content:
print(f"Successfully loaded content from {file_path}")
# Proceed with processing text_content...
A significant consideration when working with text files is character encoding. While UTF-8 is a widely accepted standard, you might encounter files saved with different encodings (like latin-1
or cp1252
). If you see garbled text or encounter UnicodeDecodeError
, you may need to specify the correct encoding when opening the file. Identifying the correct encoding can sometimes require inspection or trial-and-error.
PDF (Portable Document Format) files are ubiquitous for sharing documents while preserving layout, but extracting text from them programmatically can be more involved than from plain text files. PDFs can contain text, images, vector graphics, and complex layout information.
Libraries like pypdf
(a popular Python library for working with PDFs) or higher-level tools available in frameworks like LangChain and LlamaIndex can simplify text extraction.
# Using pypdf as an example (install with: pip install pypdf)
from pypdf import PdfReader
def load_pdf_text(file_path):
"""Extracts text content from a PDF file."""
try:
reader = PdfReader(file_path)
text = ""
for page in reader.pages:
page_text = page.extract_text()
if page_text: # Check if text extraction was successful
text += page_text + "\n" # Add newline between pages
return text.strip() # Remove leading/trailing whitespace
except FileNotFoundError:
print(f"Error: File not found at {file_path}")
return None
except Exception as e:
print(f"Error reading PDF file {file_path}: {e}")
return None
# Example usage:
pdf_path = 'documents/research_paper.pdf'
pdf_text_content = load_pdf_text(pdf_path)
if pdf_text_content:
print(f"Successfully extracted text from {pdf_path}")
# Proceed with processing pdf_text_content...
Challenges with PDFs include:
Web pages are another common source of information. Loading requires two steps: fetching the HTML content from a URL and then parsing the HTML to extract the relevant text, discarding boilerplate like navigation menus, ads, and footers.
Python's requests
library is commonly used for fetching web content, and libraries like BeautifulSoup
are excellent for parsing HTML.
# Example using requests and BeautifulSoup
# Install with: pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
def load_web_page_text(url):
"""Fetches and extracts text content from a web page URL."""
try:
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}) # Be a polite scraper
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
soup = BeautifulSoup(response.content, 'html.parser')
# Basic text extraction (can be refined significantly)
# Remove script and style elements
for script_or_style in soup(["script", "style"]):
script_or_style.decompose()
# Get text, using a separator to ensure space between elements
text = soup.get_text(separator=' ', strip=True)
return text
except requests.exceptions.RequestException as e:
print(f"Error fetching URL {url}: {e}")
return None
except Exception as e:
print(f"Error parsing HTML from {url}: {e}")
return None
# Example usage:
web_url = 'https://example-documentation-site.com/page'
web_text_content = load_web_page_text(web_url)
if web_text_content:
print(f"Successfully loaded text from {web_url}")
# Proceed with processing web_text_content...
Web scraping can be fragile. Website structures change, and extracting only the main article content often requires specific targeting of HTML tags (like <article>
, <main>
, or specific CSS classes) identified by inspecting the page source. Be mindful of website robots.txt
files and terms of service regarding automated access.
You might encounter other formats like Microsoft Word (.docx
), Markdown (.md
), JSON, or CSV files containing textual data relevant to your RAG system. For most common formats, Python libraries exist to parse them (e.g., python-docx
for Word documents, json
for JSON, csv
for CSV).
The general pattern remains the same:
Many RAG frameworks (like LangChain and LlamaIndex) provide convenient abstractions called Document Loaders. These loaders encapsulate the logic for reading various data sources (files, web pages, databases, APIs like Notion or Slack) and converting them into a standardized Document
object. A Document
typically contains the extracted text (page_content
) and associated metadata (e.g., source
file path or URL, page number, chapter).
# Example (syntax may vary based on framework)
# from rag_framework.document_loaders import PyPDFLoader, WebBaseLoader
# pdf_loader = PyPDFLoader("documents/research_paper.pdf")
# web_loader = WebBaseLoader("https://example-documentation-site.com/page")
# pdf_documents = pdf_loader.load() # Returns list of Document objects
# web_documents = web_loader.load() # Returns list of Document objects
# print(pdf_documents[0].page_content)
# print(pdf_documents[0].metadata) # {'source': 'documents/research_paper.pdf', 'page': 0}
Using these loaders can significantly speed up development by providing pre-built integrations for many common data sources. However, understanding the underlying principles of file I/O and parsing, as discussed above, is valuable for troubleshooting and handling less common or custom formats.
Regardless of the method used, the goal of this loading stage is to convert your diverse knowledge sources into a consistent in-memory representation, usually strings or these Document
objects, ready for the next crucial step: chunking.
© 2025 ApX Machine Learning