A high-quality knowledge base is essential for an effective Retrieval-Augmented Generation (RAG) system. Before an LLM can answer questions about data, that data must be collected, standardized, and structured. Data loading or ingestion involves transforming raw information from various sources into a consistent format that subsequent stages of the pipeline can process.
Your source material might exist in many forms: plain text files, structured Markdown documents, web pages, PDFs, or even records from a database. The goal of the loading stage is to read this diverse information and represent it in a uniform way.
To handle this variety, we use a standardized data structure: the Document object. Think of it as a container that holds not just the text content, but also important metadata about its origin and characteristics. This structure ensures that no matter where the data comes from, our RAG pipeline knows how to handle it.
A Document object typically contains:
You can create a Document programmatically when dealing with data that doesn't originate from a file, such as a database record or an API response.
from kerb.document import Document, DocumentFormat
# Creating a Document object from an API response
api_response_content = "This document was created programmatically for LLM processing."
custom_doc = Document(
content=api_response_content,
metadata={
"source": "llm_generation",
"timestamp": "2025-10-14",
"model": "gpt-4",
"purpose": "rag_corpus"
},
format=DocumentFormat.TXT,
id="doc_001"
)
print(f"Document ID: {custom_doc.id}")
print(f"Content: {custom_doc.content}")
This ability to create documents on the fly is useful for building dynamic knowledge bases that can be updated from multiple sources.
The most common starting point is a collection of files. The load_document function provides a straightforward way to ingest file-based content. It automatically detects the file format based on its extension and uses the appropriate loader, simplifying the process of handling a directory with mixed file types.
Let's say you have a simple text file, guide.txt:
This is a sample document for LLM processing.
It contains multiple lines of text.
LLMs can use this content for various tasks.
You can load it with a single function call:
from kerb.document import load_document
# Assuming 'guide.txt' is in the current directory
doc = load_document("guide.txt")
print(f"Loaded format: {doc.format.value}")
print(f"Content length: {len(doc.content)} characters")
print(f"Metadata: {doc.metadata}")
The resulting doc object now holds the content of the file, along with metadata like the source path, ready for the next step in the pipeline.
While text content is primary, metadata is significant for building advanced RAG systems. It allows you to filter searches, verify information, and provide source attribution in the LLM's final response. Some file formats, like Markdown, have established conventions for metadata that can be extracted automatically.
Consider a Markdown file, article.md, with a YAML frontmatter block:
---
title: Sample Document
author: AI Assistant
tags: [example, llm, rag]
---
# Sample Document
This document demonstrates markdown loading.
The load_markdown function (which load_document would use for .md files) automatically parses this frontmatter and places it into the metadata attribute.
from kerb.document import load_markdown
# Assuming 'article.md' is created with the content above
md_doc = load_markdown("article.md")
print(f"Frontmatter: {md_doc.metadata.get('frontmatter', {})}")
print(f"Title: {md_doc.metadata.get('frontmatter', {}).get('title')}")
This automatic extraction helps build a richer, more queryable knowledge base without extra manual effort.
Not all data is unstructured text. You might have structured data in JSON or CSV files that you want your LLM to reason about. When loading these formats, the loader converts the structured content into a human-readable text representation for the LLM while preserving the original data structure in the metadata.
For a JSON file product.json:
{
"product": "AI Assistant",
"features": ["NLP", "RAG", "Embeddings"]
}
Loading it transforms the data like this:
from kerb.document import load_json
json_doc = load_json("product.json")
# The 'content' is a string representation for the LLM
print("### Text Content ###")
print(json_doc.content)
# The 'metadata' holds the original parsed structure
print("\n### Parsed Metadata ###")
print(json_doc.metadata.get('parsed_content', {}))
This approach gives the LLM readable text to work with while making the original structured data available for other parts of the application, such as filtering or data validation.
Loading clean text files is straightforward, but real-world data is often messy. Web pages are filled with HTML tags, navigation bars, and advertisements. PDFs can introduce awkward line breaks, headers, and footers during text extraction.
While the toolkit provides loaders for these sources, the extracted raw text often requires a dedicated cleaning step. This is where preprocessing, the final section of this chapter, becomes important. The initial loading focuses on just one thing: getting the raw text and structure out of the source file. Subsequent steps will refine it.
After this loading stage, all your information, regardless of its original format, is standardized into Document objects. You now have a consistent collection of content and metadata, setting a solid foundation for the next critical steps in building your RAG system: text chunking and preprocessing.
Was this section helpful?
Document object structure and its role in data pipelines for LLM applications.© 2026 ApX Machine LearningEngineered with