Before an LLM can leverage your specific data, that data must first be brought into a format LlamaIndex understands. This initial step involves transforming raw information from its original source, whether it's a simple text file, a PDF document, a web page, or even a database, into a standardized representation. LlamaIndex handles this ingestion process through components often referred to as Readers or Data Loaders.

The fundamental unit of data within LlamaIndex is the Document object. A Document typically contains the text content extracted from the source, along with associated metadata (like the filename or URL). The primary role of a Reader is to take input data from a specific source and produce one or more Document objects.

Let's look at how to load data from common sources.

Loading Data with Readers

Readers abstract away the complexities of accessing and parsing different file types and data sources. LlamaIndex provides built-in readers for many common formats and maintains a larger collection in the LlamaIndex Hub for more specialized sources.

The general pattern involves importing the appropriate reader, instantiating it (often pointing it to the data source), and calling its load_data() method.

# Example structure (specific reader varies)
from llama_index.core import SimpleDirectoryReader # Or other specific readers

# Instantiate the reader, pointing to the data source
# For file-based readers, this is often a directory path
reader = SimpleDirectoryReader("./data_directory")

# Load the data into Document objects
documents = reader.load_data()

# Now 'documents' is a list of LlamaIndex Document objects
print(f"Loaded {len(documents)} document(s).")
# Example output: Accessing text of the first document
if documents:
    print(f"First document text snippet: {documents[0].text[:100]}...")

Reading Local Files (`SimpleDirectoryReader`)

One of the most common tasks is loading data from files stored on your local machine. The SimpleDirectoryReader is a versatile tool for this. By default, it can handle various file types like .txt, .pdf, .docx, .json, .md, and more, provided the necessary underlying libraries are installed.

# Ensure required libraries might be needed for specific file types
# pip install llama-index pypdf # Example for PDF support

from llama_index.core import SimpleDirectoryReader

# Point SimpleDirectoryReader to a folder containing your documents
reader = SimpleDirectoryReader("./path/to/your/docs")
documents = reader.load_data()

print(f"Successfully loaded {len(documents)} documents from the directory.")

SimpleDirectoryReader iterates through the specified directory. For each supported file it finds, it extracts the text content and creates a Document object. The document's metadata often includes the file path and name automatically. You can configure it to recursively search subdirectories or filter for specific file extensions.

Loading Web Pages

To ingest content directly from web pages, LlamaIndex offers readers designed for web scraping. A common choice is SimpleWebPageReader (or sometimes integrated capabilities within other readers or frameworks that use libraries like BeautifulSoup).

You'll typically need to install libraries for web requests and HTML parsing.

# Install necessary libraries
# pip install llama-index beautifulsoup4

from llama_index.core import SimpleWebPageReader

# List of URLs to load
urls = ["https://www.example.com/page1", "https://www.anothersite.org/article"]

# Instantiate the reader
# Set html_to_text=True to attempt basic HTML tag removal
reader = SimpleWebPageReader(html_to_text=True)

# Load data from the URLs
# Note: This returns a list of Documents, one per URL
web_documents = reader.load_data(urls)

print(f"Loaded content from {len(web_documents)} web page(s).")
if web_documents:
    print(f"Snippet from first web page: {web_documents[0].text[:150]}...")

The web reader fetches the HTML content from each URL, optionally processes it to extract the main text content, and creates a Document object for each page. The URL is usually stored in the document's metadata.

Diagram: Data Loading Process

Data from various sources (like text files, PDFs, or web pages) is processed by appropriate LlamaIndex Readers (e.g., SimpleDirectoryReader, SimpleWebPageReader) to create standardized Document objects containing text and metadata.

Handling Diverse Data Sources

LlamaIndex supports a wide array of data sources other than simple files and web pages. Through built-in readers and the extensive collection available on LlamaIndex Hub (a community registry), you can connect to:

Databases (SQL, NoSQL)
APIs (Slack, Google Drive, etc.)
Structured data formats (CSV, JSON)
Audio and Video (requiring transcription services)

Exploring the LlamaIndex documentation or the LlamaIndex Hub reveals the breadth of available connectors. Installing and using these often follows a similar pattern: install dependencies, import the specific reader, instantiate it with necessary configurations (like API keys or connection strings), and call load_data().

The outcome of this loading stage is consistent: a list of Document objects. These objects serve as the standardized input for the next important step in preparing your data for the LLM: Indexing.

Was this section helpful?

Loading Data (Documents, Web Pages)

Loading Data with Readers

Reading Local Files (SimpleDirectoryReader)

Loading Web Pages

Diagram: Data Loading Process

Handling Diverse Data Sources

Reading Local Files (`SimpleDirectoryReader`)