Before an LLM can leverage your specific data, that data must first be brought into a format LlamaIndex understands. This initial step involves transforming raw information from its original source, whether it's a simple text file, a PDF document, a web page, or even a database, into a standardized representation. LlamaIndex handles this ingestion process through components often referred to as Readers
or Data Loaders
.
The fundamental unit of data within LlamaIndex is the Document
object. A Document
typically contains the text content extracted from the source, along with associated metadata (like the filename or URL). The primary role of a Reader is to take input data from a specific source and produce one or more Document
objects.
Let's look at how to load data from common sources.
Readers abstract away the complexities of accessing and parsing different file types and data sources. LlamaIndex provides built-in readers for many common formats and maintains a larger collection in the LlamaIndex Hub for more specialized sources.
The general pattern involves importing the appropriate reader, instantiating it (often pointing it to the data source), and calling its load_data()
method.
# Example structure (specific reader varies)
from llama_index.core import SimpleDirectoryReader # Or other specific readers
# Instantiate the reader, pointing to the data source
# For file-based readers, this is often a directory path
reader = SimpleDirectoryReader("./data_directory")
# Load the data into Document objects
documents = reader.load_data()
# Now 'documents' is a list of LlamaIndex Document objects
print(f"Loaded {len(documents)} document(s).")
# Example output: Accessing text of the first document
if documents:
print(f"First document text snippet: {documents[0].text[:100]}...")
SimpleDirectoryReader
)One of the most common tasks is loading data from files stored on your local machine. The SimpleDirectoryReader
is a versatile tool for this. By default, it can handle various file types like .txt
, .pdf
, .docx
, .json
, .md
, and more, provided the necessary underlying libraries are installed.
# Ensure required libraries might be needed for specific file types
# pip install llama-index pypdf # Example for PDF support
from llama_index.core import SimpleDirectoryReader
# Point SimpleDirectoryReader to a folder containing your documents
reader = SimpleDirectoryReader("./path/to/your/docs")
documents = reader.load_data()
print(f"Successfully loaded {len(documents)} documents from the directory.")
SimpleDirectoryReader
iterates through the specified directory. For each supported file it finds, it extracts the text content and creates a Document
object. The document's metadata often includes the file path and name automatically. You can configure it to recursively search subdirectories or filter for specific file extensions.
To ingest content directly from web pages, LlamaIndex offers readers designed for web scraping. A common choice is SimpleWebPageReader
(or sometimes integrated capabilities within other readers or frameworks that use libraries like BeautifulSoup
).
You'll typically need to install libraries for web requests and HTML parsing.
# Install necessary libraries
# pip install llama-index beautifulsoup4
from llama_index.core import SimpleWebPageReader
# List of URLs to load
urls = ["https://www.example.com/page1", "https://www.anothersite.org/article"]
# Instantiate the reader
# Set html_to_text=True to attempt basic HTML tag removal
reader = SimpleWebPageReader(html_to_text=True)
# Load data from the URLs
# Note: This returns a list of Documents, one per URL
web_documents = reader.load_data(urls)
print(f"Loaded content from {len(web_documents)} web page(s).")
if web_documents:
print(f"Snippet from first web page: {web_documents[0].text[:150]}...")
The web reader fetches the HTML content from each URL, optionally processes it to extract the main text content, and creates a
Document
object for each page. The URL is usually stored in the document's metadata.
Data from various sources (like text files, PDFs, or web pages) is processed by appropriate LlamaIndex Readers (e.g.,
SimpleDirectoryReader
,SimpleWebPageReader
) to create standardizedDocument
objects containing text and metadata.
LlamaIndex supports a wide array of data sources beyond simple files and web pages. Through built-in readers and the extensive collection available on LlamaIndex Hub (a community registry), you can connect to:
Exploring the LlamaIndex documentation or the LlamaIndex Hub reveals the breadth of available connectors. Installing and using these often follows a similar pattern: install dependencies, import the specific reader, instantiate it with necessary configurations (like API keys or connection strings), and call load_data()
.
The outcome of this loading stage is consistent: a list of Document
objects. These objects serve as the standardized input for the next crucial step in preparing your data for the LLM: Indexing.
© 2025 ApX Machine Learning