Data Loading Fundamentals

A high-quality knowledge base is essential for an effective Retrieval-Augmented Generation (RAG) system. Before an LLM can answer questions about data, that data must be collected, standardized, and structured. Data loading or ingestion involves transforming raw information from various sources into a consistent format that subsequent stages of the pipeline can process.

Your source material might exist in many forms: plain text files, structured Markdown documents, web pages, PDFs, or even records from a database. The goal of the loading stage is to read this diverse information and represent it in a uniform way.

The Document as a Standard Container

To handle this variety, we use a standardized data structure: the Document object. Think of it as a container that holds not just the text content, but also important metadata about its origin and characteristics. This structure ensures that no matter where the data comes from, our RAG pipeline knows how to handle it.

A Document object typically contains:

Content: The actual text of the document.
Metadata: A dictionary of additional information, such as the source file, author, creation date, or any other relevant details.
ID: A unique identifier for the document.

You can create a Document programmatically when dealing with data that doesn't originate from a file, such as a database record or an API response.

from kerb.document import Document, DocumentFormat

# Creating a Document object from an API response
api_response_content = "This document was created programmatically for LLM processing."

custom_doc = Document(
    content=api_response_content,
    metadata={
        "source": "llm_generation",
        "timestamp": "2025-10-14",
        "model": "gpt-4",
        "purpose": "rag_corpus"
    },
    format=DocumentFormat.TXT,
    id="doc_001"
)

print(f"Document ID: {custom_doc.id}")
print(f"Content: {custom_doc.content}")

This ability to create documents on the fly is useful for building dynamic knowledge bases that can be updated from multiple sources.

Loading Data from Files

The most common starting point is a collection of files. The load_document function provides a straightforward way to ingest file-based content. It automatically detects the file format based on its extension and uses the appropriate loader, simplifying the process of handling a directory with mixed file types.

Let's say you have a simple text file, guide.txt:

This is a sample document for LLM processing.
It contains multiple lines of text.
LLMs can use this content for various tasks.

You can load it with a single function call:

from kerb.document import load_document

# Assuming 'guide.txt' is in the current directory
doc = load_document("guide.txt")

print(f"Loaded format: {doc.format.value}")
print(f"Content length: {len(doc.content)} characters")
print(f"Metadata: {doc.metadata}")

The resulting doc object now holds the content of the file, along with metadata like the source path, ready for the next step in the pipeline.

Extracting Rich Metadata

While text content is primary, metadata is significant for building advanced RAG systems. It allows you to filter searches, verify information, and provide source attribution in the LLM's final response. Some file formats, like Markdown, have established conventions for metadata that can be extracted automatically.

For example, a Markdown file, article.md, with a YAML frontmatter block:

---
title: Sample Document
author: AI Assistant
tags: [example, llm, rag]
---

# Sample Document

This document demonstrates markdown loading.

The load_markdown function (which load_document would use for .md files) automatically parses this frontmatter and places it into the metadata attribute.

from kerb.document import load_markdown

# Assuming 'article.md' is created with the content above
md_doc = load_markdown("article.md")

print(f"Frontmatter: {md_doc.metadata.get('frontmatter', {})}")
print(f"Title: {md_doc.metadata.get('frontmatter', {}).get('title')}")

This automatic extraction helps build a richer, more queryable knowledge base without extra manual effort.

Handling Structured Data Sources

Not all data is unstructured text. You might have structured data in JSON or CSV files that you want your LLM to reason about. When loading these formats, the loader converts the structured content into a human-readable text representation for the LLM while preserving the original data structure in the metadata.

For a JSON file product.json:

{
  "product": "AI Assistant",
  "features": ["NLP", "RAG", "Embeddings"]
}

Loading it transforms the data like this:

from kerb.document import load_json

json_doc = load_json("product.json")

# The 'content' is a string representation for the LLM
print("### Text Content ###")
print(json_doc.content)

# The 'metadata' holds the original parsed structure
print("\n### Parsed Metadata ###")
print(json_doc.metadata.get('parsed_content', {}))

This approach gives the LLM readable text to work with while making the original structured data available for other parts of the application, such as filtering or data validation.

Preparing Data from Complex Sources

Loading clean text files is straightforward, but real-world data is often messy. Web pages are filled with HTML tags, navigation bars, and advertisements. PDFs can introduce awkward line breaks, headers, and footers during text extraction.

While the toolkit provides loaders for these sources, the extracted raw text often requires a dedicated cleaning step. This is where preprocessing, the final section of this chapter, becomes important. The initial loading focuses on just one thing: getting the raw text and structure out of the source file. Subsequent steps will refine it.

After this loading stage, all your information, regardless of its original format, is standardized into Document objects. You now have a consistent collection of content and metadata, setting a solid foundation for the next critical steps in building your RAG system: text chunking and preprocessing.

Was this section helpful?

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020 Advances in Neural Information Processing Systems (NeurIPS 2020) DOI: 10.48550/arXiv.2005.11401 - Describes the original RAG architecture, emphasizing the role of a retrieved knowledge base and its implications for knowledge access.
Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, 2008 (Cambridge University Press) - A comprehensive textbook covering document representation, indexing, and retrieval methods, fundamental to building search and RAG systems.
Documents, LangChain, 2024 (LangChain) - Official documentation explaining the Document object structure and its role in data pipelines for LLM applications.
Retrieval-Augmented Generation (RAG) on Google Cloud, Google Cloud, 2024 (Google Cloud) - An official guide explaining the RAG architecture, including the initial data ingestion and knowledge base creation process on a cloud platform.