All Courses

Advanced AI Infrastructure Design and Optimization

Chapter 1: Chapter 1: Architectural Patterns for AI Platforms

MLOps Principles at Scale

Compute Selection: CPU, GPU, and TPU Architectures

High-Bandwidth Interconnects for Distributed Systems

Storage Solutions for Large-Scale AI Datasets

Networking Topologies for ML Clusters

Hands-on Practical: Environment and Tooling Setup

Chapter 2: Chapter 2: Engineering Distributed Model Training

Data Parallelism with Synchronous and Asynchronous Updates

Model and Pipeline Parallelism for Large Models

Implementing Training with Horovod

Leveraging Microsoft DeepSpeed for ZeRO and Offloading

Fault Tolerance and Checkpointing in Long-Running Jobs

Hands-on Practical: Distributed Training with PyTorch FSDP

Chapter 3: Chapter 3: Advanced Resource Orchestration with Kubernetes

Managing ML Workflows with KubeFlow Pipelines

Advanced GPU Scheduling and Sharing

Cluster Autoscaling for Dynamic ML Workloads

Strategies for Using Spot and Preemptible Instances

Multi-Tenancy with Namespaces, Quotas, and Priority Classes

Practice: Configure a GPU-Aware Autoscaling Group

Chapter 4: Chapter 4: High-Performance Model Inference and Serving

Architecting Inference Services for Latency and Throughput

Model Optimization with TensorRT and ONNX Runtime

Model Quantization Techniques: INT8 and FP8

Serving Multiple Models with NVIDIA Triton Inference Server

A/B Testing and Canary Deployments for Models

Hands-on Practical: Deploying an Optimized Model on Triton

Chapter 5: Chapter 5: Scalable Data Management and Feature Engineering

Designing and Implementing a Feature Store

Real-time vs. Batch Feature Computation

Data Versioning and Lineage with DVC and Pachyderm

High-Throughput Data Processing with Spark and Ray

Managing Data Lakes and Data Warehouses for AI

Practice: Build a Basic Feature Ingestion Pipeline

Chapter 6: Chapter 6: Financial Operations and Governance for AI

Applying FinOps Principles to ML Workloads

Cost Attribution and Showback Models for ML Teams

Optimizing Cloud Storage Costs for Datasets

Right-Sizing Compute for Training and Inference

Automating Cost Anomaly Detection

Governance Policies for Resource Consumption

Practice: Analyzing a Cloud Cost and Usage Report

High-Throughput Data Processing with Spark and Ray

Was this section helpful?

References

Apache Spark Documentation, The Apache Software Foundation, 2024 - Provides comprehensive guidance on Spark's architecture, DataFrame API, PySpark, and distributed processing capabilities.
Ray Documentation, Anyscale, 2024 - Offers detailed information on Ray Core, Ray Data, task and actor primitives, and building distributed Python applications.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Blake Goldin, Joseph Hellerstein, Mohammad Isard, S. Krishnan, Haoyuan Li, Scott McVeety, Andy Konwinski, Patrick Wendell, Adam Wilde, Michael Franklin, Ion Stoica, 2012 Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI '12) (USENIX) DOI: 10.1145/2228298.2228301 - Introduces the RDD abstraction, which is fundamental to Spark's design for fault-tolerant and efficient distributed data processing.
Ray: A Distributed System for AI Applications, Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Gungor, Eric Choi, Joseph E. Gonzalez, Ion Stoica, 2018 Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI '18) - Describes Ray's flexible architecture for general-purpose distributed Python computation, designed to support diverse AI workloads.

© 2025 ApX Machine LearningEngineered with