The first step in building a Retrieval-Augmented Generation (RAG) system is to load your source information. This data might exist in various formats: plain text files, structured Markdown documents, web pages, or even tabular data in CSV files. To prepare this diverse information for an LLM, you first need to bring it into a consistent, standardized format.
This is where the document module comes in. It provides a straightforward way to ingest content from different sources and represent it as a universal Document object. This object acts as a standard container, holding both the text content and any associated metadata, which simplifies the subsequent steps of chunking, embedding, and retrieval.
Think of the Document object as a wrapper that standardizes your raw data. Regardless of whether the original source is a simple .txt file or a complex .json file, the pipeline treats it as a Document. This object primarily consists of two parts:
This abstraction allows the rest of the RAG pipeline to operate on a consistent data structure, without needing to know the specifics of each original file format.
The Document object standardizes ingested data, separating raw text from its associated metadata.
The most convenient way to load files is with the load_document() function. It automatically detects the file type from its extension and uses the appropriate loader, making your ingestion code clean and simple.
For unstructured or semi-structured text, the loader extracts the content and any available metadata. Let's start with a plain text file, sample.txt.
from kerb.document import load_document
# Assumes 'sample.txt' exists in your project directory
doc = load_document("sample.txt")
print(f"Content Preview: {doc.content[:50]}...")
print(f"Metadata: {doc.metadata}")
The output would show the text content and metadata containing the file's source path.
Markdown files are handled similarly, but with an added benefit: the loader automatically parses YAML frontmatter. This is a common pattern for storing metadata like titles, authors, or tags directly within the document.
Consider a file article.md with the following content:
---
title: RAG Systems Explained
author: AI Assistant
tags: [rag, llm, python]
---
# Introduction to RAG
Retrieval-Augmented Generation is a technique...
When you load this file, the frontmatter is extracted into the metadata dictionary.
from kerb.document import load_markdown
md_doc = load_markdown("article.md")
print(f"Title: {md_doc.metadata['frontmatter']['title']}")
print(f"Author: {md_doc.metadata['frontmatter']['author']}")
print(f"Tags: {md_doc.metadata['frontmatter']['tags']}")
This automatic metadata extraction is significant for building a sophisticated RAG system, as it allows you to filter or search based on document properties later in the pipeline.
While RAG systems are often associated with unstructured text, they can also draw from structured sources like JSON or CSV files. The loaders for these formats convert the structured data into a human-readable string for the content field, while preserving the original structure in the metadata.
When loading a JSON file, the raw string is placed in content, and the parsed dictionary is stored in metadata['parsed_content'].
from kerb.document import load_json
# Assuming 'product.json' contains {"name": "AI Toolkit", "version": "1.2"}
json_doc = load_json("product.json")
print(f"Content:\n{json_doc.content}")
print(f"Parsed Name: {json_doc.metadata['parsed_content']['name']}")
Similarly, for a CSV file, the loader converts the table into a string for content and enriches the metadata with headers and a list of rows, where each row is a dictionary.
from kerb.document import load_csv
# Assuming 'data.csv' has columns: id, category, relevance
csv_doc = load_csv("data.csv")
print(f"Headers: {csv_doc.metadata['headers']}")
print(f"Number of Rows: {csv_doc.metadata['num_rows']}")
print(f"First Row: {csv_doc.metadata['rows'][0]}")
This dual representation gives you the flexibility to use the text for semantic retrieval while retaining the structured data for filtering or other processing tasks.
Your data won't always reside in local files. It may come from a web page, a database query, or an API response. In these cases, you can create a Document object programmatically.
For example, to process a web page, you would first fetch its HTML content using a library like requests. Then, you would use a function from the preprocessing module to clean the HTML and extract the main text. Finally, you would instantiate a Document object with this text.
from kerb.document import Document, DocumentFormat
# from kerb.preprocessing import extract_text_from_html
# import requests # This is an external library
# Step 1: Fetch web page content (using an external library)
# html_content = requests.get("https://example.com/article").text
# Step 2: Extract clean text (using a function we'll cover later)
# clean_text = extract_text_from_html(html_content)
clean_text = "This is the clean text extracted from a web page."
# Step 3: Create a Document object programmatically
web_doc = Document(
content=clean_text,
metadata={
"source": "https://example.com/article",
"fetch_date": "2024-10-26"
},
format=DocumentFormat.HTML
)
print(f"Source: {web_doc.metadata['source']}")
print(f"Content Preview: {web_doc.content[:50]}...")
This pattern is extremely versatile. Any data that can be represented as a string can be loaded into a Document, making it a universal entry point into your RAG pipeline.
With your raw data now loaded into standardized Document objects, the next challenge is to break these documents into smaller, manageable pieces suitable for an LLM.
Was this section helpful?
Document objects, a common approach in RAG systems.Node objects (similar to Document objects) with extracted metadata for RAG.© 2026 ApX Machine LearningEngineered with