Masterclass
Once your massive text corpus has been scraped, cleaned, and preprocessed, you face the non-trivial challenge of storing it efficiently and accessing it rapidly during training. Choosing the right storage format is a foundational decision that impacts storage costs, I/O performance, and the overall efficiency of your data loading pipeline. While simple formats are easy to start with, specialized formats become increasingly beneficial as datasets scale into terabytes or petabytes.
Let's examine the common choices: plain text files, Apache Arrow, and Apache Parquet.
.txt
, .jsonl
)The most straightforward approach is often storing text data in plain text files. Each line might represent a document, or you might use formats like JSON Lines (.jsonl
), where each line is a valid JSON object representing a data record (e.g., containing text and metadata).
Pros:
grep
, sed
, awk
) work directly.Cons:
A typical .jsonl
file might look like this:
{
"text": "The quick brown fox jumps over the lazy dog.",
"source": "wikipedia",
"id": "doc_001"
}
{
"text": "Large language models require vast amounts of text data.",
"source": "common_crawl",
"id": "cc_abc"
}
{
"text": "...",
"source": "...",
"id": "..."
}
Reading this in Python is simple, but potentially slow for very large files, especially if deserializing complex JSON:
import json
import gzip
def read_jsonl_gz(filepath):
"""Reads records from a gzipped JSON Lines file."""
with gzip.open(filepath, 'rt', encoding='utf-8') as f:
for line in f:
try:
yield json.loads(line)
except json.JSONDecodeError:
# Handle or log malformed lines
print(f"Skipping malformed line: {line.strip()}")
# Usage example
# dataloader would iterate through this generator
# data_generator = read_jsonl_gz("my_large_dataset.jsonl.gz")
# for record in data_generator:
# process(record['text'])
While acceptable for smaller datasets or initial prototyping, the CPU overhead of parsing text line-by-line becomes a bottleneck when feeding multiple GPUs training an LLM that consumes data at high speeds.
Apache Arrow is an in-memory columnar data format standard. It's designed for analytical query performance and efficient data interchange between systems and languages with minimal serialization/deserialization overhead (often zero-copy reads).
Key Concepts:
Pros:
Cons:
.arrow
or .feather
), it's primarily optimized for in-memory representation. Parquet is generally preferred for persistent, compressed storage on disk.Arrow is heavily used under the hood by libraries like Hugging Face datasets
. When you load or map datasets using datasets
, it often uses Arrow tables internally for caching and fast access.
import pyarrow as pa
import pyarrow.feather as feather
import time
# Example: Creating and writing an Arrow Table
# Assume 'data' is a list of dictionaries like [{'text': '...', 'id': '...'}, ...]
# Convert Python objects to Arrow Arrays
texts = pa.array([d['text'] for d in data], type=pa.string())
ids = pa.array([d['id'] for d in data], type=pa.string())
# Create an Arrow Table
table = pa.Table.from_arrays([texts, ids], names=['text', 'id'])
# Write to Feather (Arrow file format)
feather.write_feather(table, 'my_dataset.arrow')
# Reading is typically very fast
start_time = time.time()
read_table = feather.read_table('my_dataset.arrow')
end_time = time.time()
print(f"Arrow read time: {end_time - start_time:.4f} seconds")
# Accessing a column is efficient
text_column = read_table['text']
# print(text_column[0].as_py()) # Access the first text entry
The real power comes when multiple stages of your pipeline use Arrow, avoiding costly serialization steps. A data preprocessing job running on Spark might output Arrow files directly to cloud storage, which are then read efficiently by a PyTorch DataLoader
using pyarrow
.
Apache Parquet is a widely adopted columnar storage format optimized for large-scale data warehousing and analytics, but also highly effective for storing LLM datasets on disk.
Key Concepts:
Pros:
pandas
, pyarrow
).Cons:
Parquet is often the format of choice for the final, processed dataset that will be repeatedly read during training.
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd # Often used as an intermediary
# Example using PyArrow directly (similar to Arrow Feather example)
# Assume 'table' is a pyarrow.Table as created in the Arrow example
# Writing a Parquet file (with Snappy compression by default)
pq.write_table(table, 'my_dataset.parquet', compression='snappy')
# Reading the Parquet file
start_time = time.time()
# Can read specific columns only, reducing I/O
read_parquet_table = pq.read_table('my_dataset.parquet', columns=['text'])
end_time = time.time()
print(
f"Parquet read time (text column only): "
f"{end_time - start_time:.4f} seconds"
)
# Example using Pandas (common workflow)
# df = pd.DataFrame(data) # data is list of dicts
# df.to_parquet('my_dataset_pd.parquet', engine='pyarrow', compression='snappy')
# read_df = pd.read_parquet('my_dataset_pd.parquet', engine='pyarrow', columns=['text'])
Here's a summary of the trade-offs:
Feature | Text (.txt, .jsonl) | Apache Arrow | Apache Parquet |
---|---|---|---|
Type | Row-based | Columnar (In-Memory) | Columnar (On-Disk) |
Primary Use | Simplicity, Debugging | Fast In-Memory Analytics, IPC | Efficient Storage, Analytics |
Human Readable | Yes | No | No |
Compression | External (Gzip, Zstd) | Limited (IPC often raw) | Excellent (Built-in) |
Read Speed | Slow (Parsing Overhead) | Very Fast (Zero-Copy) | Fast (Columnar Scan) |
Write Speed | Fast | Moderate | Moderate-Slow (Encoding) |
CPU Usage (Read) | High (Parsing) | Low | Low-Moderate (Decoding) |
Random Access | Poor | Moderate (In-Memory) | Moderate (Row Groups) |
Ecosystem | Universal | Growing (Pandas, Spark) | Excellent (Big Data) |
Comparison of common data storage formats for large text datasets.
Recommendations for LLM Datasets:
datasets
leverage Arrow heavily for caching and memory mapping.Choosing Parquet for your large-scale training data stored on cloud storage or HDFS, potentially read via libraries that leverage Arrow for in-memory representation (like datasets
or custom DataLoader
s using pyarrow.parquet
), provides a robust and efficient foundation for feeding data into your distributed training jobs. This minimizes I/O bottlenecks and storage footprint, allowing you to focus compute resources on the model training itself.
© 2025 ApX Machine Learning