All Courses

Planning and Optimizing AI Infrastructure

Chapter 1: Foundations of AI Compute Infrastructure

Introduction to AI Workloads

The Role of CPUs in AI Systems

The Role of GPUs in Accelerating AI

Comparing CPU and GPU Architectures for ML

Introduction to TPUs and other ASICs

Memory and its Importance for Large Models

Storage Solutions for AI Datasets

Networking Considerations for Distributed Systems

Hands-on Practical: Benchmarking CPU vs GPU

Chapter 2: Designing On-Premise AI Infrastructure

Assessing Workload Requirements

Selecting Server Hardware for AI

GPU Interconnect Technologies

High-Speed Storage Configurations

Networking for Data and Model Transfer

Power and Cooling Requirements

Building a Bare-Metal AI Server

Practice: Creating a Hardware Specification Sheet

Chapter 3: Leveraging Cloud Platforms for AI

Overview of Major Cloud Providers for AI

Comparing Managed AI Services vs IaaS

Selecting Virtual Machine Instances for Training

Choosing Instances for Inference and Serving

Object Storage Services for Datasets

Understanding Cloud Networking and VPCs

Security Considerations in the Cloud

Hands-on Practical: Launching a GPU Cloud Instance

Chapter 4: Containerization and Orchestration for ML

Introduction to Docker for Reproducible Environments

Building a Docker Image with ML Libraries

Introduction to Kubernetes for Managing ML Workloads

Kubernetes Components: Pods, Services, Deployments

Managing GPU Resources in a Kubernetes Cluster

Using Kubeflow for ML Pipelines

Hands-on Practical: Deploying a Model on Kubernetes

Chapter 5: Strategies for Performance Optimization

Identifying Performance Bottlenecks

Techniques for Distributed Training

Using Mixed-Precision Training

Model Quantization for Efficient Inference

Optimizing Data Loading and Preprocessing Pipelines

Profiling GPU and CPU Usage

Hands-on Practical: Applying Mixed-Precision Training

Chapter 6: Cost Management and Optimization

Analyzing On-Premise Total Cost of Ownership

Understanding Cloud Pricing Models

Strategies for Reducing Cloud Compute Costs

Managing Data Storage and Transfer Costs

Implementing Cost Monitoring and Alerting

Right-Sizing Infrastructure for Workloads

Practice: Calculating and Comparing Job Costs

Memory and its Importance for Large Models

Was this section helpful?

References

Adam: A Method for Stochastic Optimization, Diederik P. Kingma, Jimmy Ba, 2015 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1412.6980 - Introduces the Adam optimizer, detailing its design and the two first and second moment estimates per parameter, which explain its memory footprint.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A foundational textbook covering neural network training, backpropagation, and memory considerations for activations and parameters.
NVIDIA Deep Learning Performance Guide, NVIDIA Corporation, 2023 (NVIDIA Corporation) - Official guide detailing best practices for optimizing deep learning performance on NVIDIA GPUs, including memory management considerations.
ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models, Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He, 2020 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20) (ACM) DOI: 10.1145/3410464.3410714 - Presents ZeRO, a family of memory optimization technologies essential for training models with billions of parameters by efficiently distributing model states across multiple GPUs.

© 2025 ApX Machine LearningEngineered with