Loading Documents from Different Sources

The first step in building a Retrieval-Augmented Generation (RAG) system is to load your source information. This data might exist in various formats: plain text files, structured Markdown documents, web pages, or even tabular data in CSV files. To prepare this diverse information for an LLM, you first need to bring it into a consistent, standardized format.

This is where the document module comes in. It provides a straightforward way to ingest content from different sources and represent it as a universal Document object. This object acts as a standard container, holding both the text content and any associated metadata, which simplifies the subsequent steps of chunking, embedding, and retrieval.

The Document Object: A Standard Container

Think of the Document object as a wrapper that standardizes your raw data. Regardless of whether the original source is a simple .txt file or a complex .json file, the pipeline treats it as a Document. This object primarily consists of two parts:

content: A string that holds the full text of the source material.
metadata: A dictionary containing additional information, such as the file path, format, or structured data extracted during loading.

This abstraction allows the rest of the RAG pipeline to operate on a consistent data structure, without needing to know the specifics of each original file format.

The Document object standardizes ingested data, separating raw text from its associated metadata.

Loading from Common File Formats

The most convenient way to load files is with the load_document() function. It automatically detects the file type from its extension and uses the appropriate loader, making your ingestion code clean and simple.

Plain Text and Markdown

For unstructured or semi-structured text, the loader extracts the content and any available metadata. Let's start with a plain text file, sample.txt.

from kerb.document import load_document

# Assumes 'sample.txt' exists in your project directory
doc = load_document("sample.txt")

print(f"Content Preview: {doc.content[:50]}...")
print(f"Metadata: {doc.metadata}")

The output would show the text content and metadata containing the file's source path.

Markdown files are handled similarly, but with an added benefit: the loader automatically parses YAML frontmatter. This is a common pattern for storing metadata like titles, authors, or tags directly within the document.

For example, a file article.md with the following content:

---
title: RAG Systems Explained
author: AI Assistant
tags: [rag, llm, python]
---

# Introduction to RAG

Retrieval-Augmented Generation is a technique...

When you load this file, the frontmatter is extracted into the metadata dictionary.

from kerb.document import load_markdown

md_doc = load_markdown("article.md")

print(f"Title: {md_doc.metadata['frontmatter']['title']}")
print(f"Author: {md_doc.metadata['frontmatter']['author']}")
print(f"Tags: {md_doc.metadata['frontmatter']['tags']}")

This automatic metadata extraction is significant for building a sophisticated RAG system, as it allows you to filter or search based on document properties later in the pipeline.

JSON and CSV

While RAG systems are often associated with unstructured text, they can also draw from structured sources like JSON or CSV files. The loaders for these formats convert the structured data into a human-readable string for the content field, while preserving the original structure in the metadata.

When loading a JSON file, the raw string is placed in content, and the parsed dictionary is stored in metadata['parsed_content'].

from kerb.document import load_json

# Assuming 'product.json' contains {"name": "AI Toolkit", "version": "1.2"}
json_doc = load_json("product.json")

print(f"Content:\n{json_doc.content}")
print(f"Parsed Name: {json_doc.metadata['parsed_content']['name']}")

Similarly, for a CSV file, the loader converts the table into a string for content and enriches the metadata with headers and a list of rows, where each row is a dictionary.

from kerb.document import load_csv

# Assuming 'data.csv' has columns: id, category, relevance
csv_doc = load_csv("data.csv")

print(f"Headers: {csv_doc.metadata['headers']}")
print(f"Number of Rows: {csv_doc.metadata['num_rows']}")
print(f"First Row: {csv_doc.metadata['rows'][0]}")

This dual representation gives you the flexibility to use the text for semantic retrieval while retaining the structured data for filtering or other processing tasks.

Handling Web Pages and In-Memory Data

Your data won't always reside in local files. It may come from a web page, a database query, or an API response. In these cases, you can create a Document object programmatically.

For example, to process a web page, you would first fetch its HTML content using a library like requests. Then, you would use a function from the preprocessing module to clean the HTML and extract the main text. Finally, you would instantiate a Document object with this text.

from kerb.document import Document, DocumentFormat
# from kerb.preprocessing import extract_text_from_html
# import requests # This is an external library

# Step 1: Fetch web page content (using an external library)
# html_content = requests.get("https://example.com/article").text

# Step 2: Extract clean text (using a function we'll cover later)
# clean_text = extract_text_from_html(html_content)
clean_text = "This is the clean text extracted from a web page."

# Step 3: Create a Document object programmatically
web_doc = Document(
    content=clean_text,
    metadata={
        "source": "https://example.com/article",
        "fetch_date": "2024-10-26"
    },
    format=DocumentFormat.HTML
)

print(f"Source: {web_doc.metadata['source']}")
print(f"Content Preview: {web_doc.content[:50]}...")

This pattern is extremely versatile. Any data that can be represented as a string can be loaded into a Document, making it a universal entry point into your RAG pipeline.

With your raw data now loaded into standardized Document objects, the next challenge is to break these documents into smaller, manageable pieces suitable for an LLM.

Was this section helpful?

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020 Advances in Neural Information Processing Systems, Vol. 33 (Neural Information Processing Systems Foundation, Inc.) DOI: 10.55917/cb.2023-1123 - Introduces the Retrieval-Augmented Generation (RAG) architecture, foundational for understanding the system data loading discussed.
Document loaders, LangChain, 2024 (LangChain) - Official documentation covering how LangChain handles loading diverse document types and representing them as Document objects, a common approach in RAG systems.
Ingestion Pipeline, LlamaIndex, 2024 (LlamaIndex) - Official guide detailing LlamaIndex's ingestion process, showing how raw data is converted into Node objects (similar to Document objects) with extracted metadata for RAG.
Speech and Language Processing, Daniel Jurafsky and James H. Martin, 2023 (Online (3rd Edition Draft)) - Chapter 10, "Information Extraction," from a leading NLP textbook, discussing methods for extracting structured information and metadata from text, relevant to data preparation for RAG.