To effectively connect your external data sources with Large Language Models, LlamaIndex introduces two fundamental building blocks: Nodes
and Indexes
. Understanding these is foundational to using the library for data ingestion and retrieval.
Think of a Node
as the smallest, self-contained "chunk" or piece of information derived from your source documents. When you load data (like a PDF file, a web page, or text from a database) into LlamaIndex, it's typically broken down into these manageable Node
objects.
Each Node
encapsulates not just a segment of the original text but also associated metadata and relationships. The primary components of a Node
are:
For example, if you load a 10-page report, LlamaIndex might parse it into multiple Node
objects. One Node
might contain a paragraph from page 3, with metadata indicating {"source_document": "report.pdf", "page_number": 3}
. Another Node
might contain the next paragraph, linked via relationships. This process of breaking down larger documents into smaller, indexed Node
objects is often referred to as "chunking".
# Simplified representation of a Node object
class Node:
def __init__(self, text, metadata=None, relationships=None):
self.text = text # The actual text content of the chunk
self.metadata = metadata or {} # e.g., {'file_name': 'chapter_1.txt', 'section': 'Introduction'}
self.relationships = relationships or {} # e.g., {'previous': node_id_1, 'next': node_id_3}
# Example:
node_text = "LlamaIndex uses Nodes to represent chunks of data..."
node_metadata = {"source_doc": "documentation.html", "chunk_id": 101}
a_node = Node(text=node_text, metadata=node_metadata)
print(f"Node Text: {a_node.text}")
print(f"Node Metadata: {a_node.metadata}")
While Nodes
represent the individual pieces of data, an Index
is the data structure that organizes these Nodes
to enable efficient searching and retrieval. You build an Index
over a collection of Node
objects.
The primary purpose of an Index
is to allow you to quickly find the most relevant Nodes
based on a query (which could be a natural language question, keywords, or another piece of text). Different types of indexes employ different strategies for this organization.
Node
text embeddings (numerical representations capturing semantic meaning) in a specialized database (a vector store). Queries are also embedded, and the index finds nodes whose embeddings are closest in meaning to the query embedding. This is powerful for semantic search.LlamaIndex abstracts away much of the complexity of creating and managing these structures. You typically load your data, which creates Nodes
, and then instantiate an Index
class with those Nodes
.
A diagram showing how a source document is broken down into Nodes, which are then organized within a LlamaIndex Index structure to facilitate retrieval based on a user query.
This separation of data into Nodes
and their organization into Indexes
provides several advantages:
Nodes
.Indexes
are optimized for fast retrieval, which is essential when dealing with large amounts of data. Searching through raw documents for relevant passages would be far too slow for real-time applications.Index
types cater to different retrieval needs (semantic search, keyword search, summarization).Nodes
relevant to a query, you can provide focused, concise context to an LLM, respecting context window limits and improving response quality.When you query a LlamaIndex Index
, it returns the most relevant Node
(s). The text and metadata from these Nodes
are then typically formatted into a prompt and sent to an LLM. This is the core mechanism behind Retrieval-Augmented Generation (RAG), enabling LLMs to answer questions or generate text based on your specific, external data rather than just their internal training knowledge. You'll learn more about building full RAG systems in the next chapter.
© 2025 ApX Machine Learning