Okay, let's put theory into practice. Managing petabyte-scale datasets requires robust and scalable storage solutions. Object storage services provided by major cloud platforms (like AWS S3, Google Cloud Storage, Azure Blob Storage) are fundamental building blocks for LLMOps data infrastructure. They offer virtually unlimited capacity, high durability, and accessibility patterns suitable for large files and distributed access.
This practice session focuses on setting up and interacting with a scalable object storage bucket using AWS S3 as the primary example. The principles and commands are analogous for other cloud providers.
By the end of this practice, you will be able to:
boto3
.aws configure
). Ensure your IAM user has permissions to create and manage S3 buckets (s3:CreateBucket
, s3:PutObject
, s3:GetObject
, s3:ListBucket
, s3:DeleteObject
, s3:DeleteBucket
).boto3
library (pip install boto3
).Object storage buckets serve as containers for your data. Choosing the right region and name is important for performance and organization.
We'll use the AWS CLI to create a bucket. Bucket names must be globally unique. Choose a unique name, incorporating perhaps your initials, project name, and date to ensure uniqueness. Also, select a region close to your compute resources or users to minimize latency.
# Choose a unique bucket name (replace 'your-unique-llmops-data-bucket-yyyymmdd')
BUCKET_NAME="your-unique-llmops-data-bucket-yyyymmdd"
# Choose your preferred AWS region (e.g., us-east-1)
AWS_REGION="us-east-1"
# Create the bucket
aws s3api create-bucket \
--bucket ${BUCKET_NAME} \
--region ${AWS_REGION} \
--create-bucket-configuration LocationConstraint=${AWS_REGION}
echo "Bucket ${BUCKET_NAME} created in region ${AWS_REGION}."
Note: For regions other than us-east-1
, specifying LocationConstraint
is required. For us-east-1
, you can omit --region
and --create-bucket-configuration
.
Let's create a small text file for testing uploads and downloads. In a real scenario, this could be a large dataset shard, model checkpoint, or configuration file.
echo "This is a sample file for testing S3 operations." > sample_data.txt
echo "LLMOps requires managing large artifacts efficiently." >> sample_data.txt
The aws s3 cp
command handles uploads. For large files (typically > 100MB, but configurable), the CLI automatically uses multipart uploads, breaking the file into parts and uploading them in parallel for better throughput and resilience. While our sample file is small, the command is the same.
# Upload the file to the root of the bucket
aws s3 cp sample_data.txt s3://${BUCKET_NAME}/
# You can also upload to a specific 'folder' (prefix)
aws s3 cp sample_data.txt s3://${BUCKET_NAME}/raw_data/sample_data_v1.txt
echo "Uploaded sample_data.txt to s3://${BUCKET_NAME}/ and s3://${BUCKET_NAME}/raw_data/"
Tip: Prefixes (like raw_data/
) are crucial for organizing large numbers of files in a bucket, allowing efficient listing and access control.
You can list objects at the root level or within specific prefixes.
# List objects at the root
echo "Objects at the root:"
aws s3 ls s3://${BUCKET_NAME}/
# List objects under the 'raw_data/' prefix recursively
echo "Objects under raw_data/:"
aws s3 ls s3://${BUCKET_NAME}/raw_data/ --recursive
For buckets with millions or billions of objects, simple ls
can be slow or incomplete. Use prefixes effectively and consider S3 Inventory for comprehensive, scheduled reports if needed.
Downloading uses the same cp
command, reversing the source and destination.
# Create a directory to download into
mkdir downloaded_files
# Download the file from the root
aws s3 cp s3://${BUCKET_NAME}/sample_data.txt downloaded_files/downloaded_sample_root.txt
# Download the file from the prefix
aws s3 cp s3://${BUCKET_NAME}/raw_data/sample_data_v1.txt downloaded_files/downloaded_sample_raw.txt
echo "Downloaded files to the 'downloaded_files' directory."
cat downloaded_files/downloaded_sample_root.txt
Verify the content of the downloaded files matches the original sample_data.txt
.
boto3
)Automated pipelines rely on programmatic access. Here's how to perform similar operations using Python's boto3
library.
Create a Python script named s3_interact.py
:
import boto3
import os
# Use the bucket name you created
bucket_name = os.environ.get("S3_BUCKET_NAME", "your-unique-llmops-data-bucket-yyyymmdd")
local_file_path = "sample_data.txt"
upload_key_python = "programmatic_uploads/sample_via_python.txt"
download_path_python = "downloaded_files/downloaded_via_python.txt"
# Create an S3 client
s3 = boto3.client('s3')
# Or use resource interface for higher-level operations
# s3_resource = boto3.resource('s3')
# bucket = s3_resource.Bucket(bucket_name)
print(f"Interacting with bucket: {bucket_name}")
# 1. Upload a file
try:
print(f"Uploading {local_file_path} to s3://{bucket_name}/{upload_key_python}")
s3.upload_file(local_file_path, bucket_name, upload_key_python)
print("Upload successful.")
except Exception as e:
print(f"Error uploading file: {e}")
# 2. List objects (using a prefix)
try:
print("\nListing objects with prefix 'programmatic_uploads/':")
response = s3.list_objects_v2(Bucket=bucket_name, Prefix="programmatic_uploads/")
if 'Contents' in response:
for obj in response['Contents']:
print(f"- {obj['Key']} (Size: {obj['Size']})")
else:
print("No objects found with that prefix.")
except Exception as e:
print(f"Error listing objects: {e}")
# 3. Download the file
try:
print(f"\nDownloading s3://{bucket_name}/{upload_key_python} to {download_path_python}")
# Ensure download directory exists
os.makedirs(os.path.dirname(download_path_python), exist_ok=True)
s3.download_file(bucket_name, upload_key_python, download_path_python)
print("Download successful.")
# Verify content
with open(download_path_python, 'r') as f:
print(f"Content of downloaded file:\n{f.read()}")
except Exception as e:
print(f"Error downloading file: {e}")
Before running, set the environment variable for your bucket name:
export S3_BUCKET_NAME="your-unique-llmops-data-bucket-yyyymmdd"
python s3_interact.py
This script demonstrates the core upload, list, and download operations essential for integrating storage into LLM data pipelines and workflows. boto3
offers many more capabilities, including managing bucket policies, lifecycle rules, and parallel operations (via libraries like s3transfer
used under the hood or higher-level abstractions).
While these examples use small files, keep these points in mind for PB-scale data:
aws s3 sync
or libraries like s5cmd
) for large files or large numbers of files./dataset_name/version/split/part-*.parquet
) significantly speeds up listing and processing subsets of data. Avoid putting millions of files directly in the bucket root.To avoid ongoing AWS charges, delete the objects and the bucket.
# Delete the objects first (use --recursive for prefixes)
aws s3 rm s3://${BUCKET_NAME}/sample_data.txt
aws s3 rm s3://${BUCKET_NAME}/raw_data/sample_data_v1.txt --recursive
aws s3 rm s3://${BUCKET_NAME}/programmatic_uploads/sample_via_python.txt --recursive
# Now delete the empty bucket
aws s3api delete-bucket --bucket ${BUCKET_NAME}
# Clean up local files
rm sample_data.txt
rm -rf downloaded_files
echo "Cleaned up S3 objects, bucket ${BUCKET_NAME}, and local files."
You have successfully set up an S3 bucket, performed basic file operations using both the CLI and Python, and learned about organizational and performance considerations relevant to large-scale data management in LLMOps. This scalable object storage serves as the foundation upon which data preprocessing pipelines, versioning systems (like DVC pointing to S3), and distributed training jobs (reading data directly from S3) are built. Mastering interaction with these storage systems is a fundamental skill for managing the data lifecycle of large models.
© 2025 ApX Machine Learning