Kubeflow Training Operator Documentation, The Kubeflow Authors, 2025 - Official guide on using the Kubeflow Training Operator to manage and run distributed machine learning training jobs (including PyTorch and TensorFlow) on Kubernetes clusters.
torch.distributed (Distributed communication package), PyTorch Contributors, 2017 - Official PyTorch documentation detailing its distributed communication package, which is fundamental for setting up and coordinating worker processes in distributed training.
Slurm Workload Manager Documentation, SchedMD LLC, 2025 (SchedMD LLC) - Comprehensive official documentation for the Slurm Workload Manager, describing its features for job scheduling, resource management, and distributed process launching in HPC environments.
Custom training overview (Google Cloud AI Platform Training), Google Cloud, 2024 (Google Cloud) - An overview of Google Cloud's managed service for custom model training, illustrating how cloud platforms abstract infrastructure and orchestration for distributed ML jobs.