Operating large language models introduces infrastructure and data management requirements significantly different from smaller models. Handling training datasets that can reach petabyte scale (PB) and coordinating hundreds or thousands of GPUs (NGPU) demands specific architectural patterns and operational practices. Simply scaling standard MLOps techniques often proves insufficient or prohibitively expensive.
This chapter focuses on the practical aspects of building and managing the foundation for LLM operations. You will learn to:
We begin by examining the design principles for scalable compute and networking, then move into managing the data itself, from storage and preprocessing to versioning.
2.1 Designing Scalable Compute Infrastructure
2.2 Networking Considerations for Distributed Systems
2.3 Managing Petabyte-Scale Datasets
2.4 Data Preprocessing Pipelines for LLMs
2.5 Version Control for Large Data and Models
2.6 Cloud vs On-Premise Infrastructure Trade-offs
2.7 Practice: Setting up Scalable Storage
© 2025 ApX Machine Learning