Preparing your workstation and cloud environment for practical work is essential. These setup instructions cover the command-line tools and libraries you will use throughout the course. A properly configured environment is a prerequisite for successfully executing the distributed training, resource orchestration, and model deployment tasks in the following chapters.
We assume you have administrative access to your local machine and have an active account with at least one major cloud provider (AWS, GCP, or Azure) with permissions to create and manage compute and storage resources.
Your local machine will act as the control plane for orchestrating cloud resources. The following tools are essential for interacting with container and cluster management systems.
Containerization is fundamental to modern MLOps. We will use Docker to build and manage images for our training and inference applications. If you do not have it installed, download and install Docker Desktop for your operating system.
After installation, verify that the Docker daemon is running:
docker ps
This command should execute without errors and return an empty table of running containers.
kubectl is the primary command-line interface for interacting with any Kubernetes cluster. You will use it to deploy applications, inspect cluster resources, and manage networking configurations.
Install kubectl following the official Kubernetes documentation for your operating system. Verify the client-side installation with:
kubectl version --client
This will output the client version, confirming that the binary is in your system's PATH. We will configure its connection to a cloud-based cluster later.
Helm helps manage Kubernetes applications through "charts", which are pre-packaged sets of resource definitions. We will use Helm to deploy more complex systems like the NVIDIA Triton Inference Server and KubeFlow.
Install Helm using the instructions on its official website. Verify the installation by running:
helm version
You must install and configure the command-line interface for your chosen cloud provider. This enables programmatic access to provision infrastructure, such as Kubernetes clusters and GPU instances.
For Amazon Web Services, install the AWS CLI. Once installed, configure it with your credentials:
aws configure
You will be prompted for your AWS Access Key ID, Secret Access Key, default region (e.g., us-east-1), and default output format (e.g., json). To confirm the setup is working, run the following command to check your identity:
aws sts get-caller-identity
For Google Cloud Platform, install the Google Cloud SDK, which includes the gcloud command-line tool. After installation, initialize the SDK:
gcloud init
This command will walk you through authenticating your account, selecting a project, and configuring a default region (e.g., us-central1) and zone. Verify authentication by listing your active projects:
gcloud projects list
For Microsoft Azure, install the Azure CLI. Authenticate by running:
az login
This command will open a browser window for you to sign in. After authenticating, set your default subscription if you have more than one:
az account set --subscription "Your-Subscription-Name-or-ID"
Verify the setup by listing the available resource groups in your account:
az group list --output table
The developer's local workstation uses CLI tools to orchestrate and deploy applications to managed services within a cloud provider's environment.
The hands-on exercises use Python 3.10 or newer. We strongly recommend using a dedicated virtual environment to manage dependencies and avoid conflicts with system-level packages.
Create and activate a virtual environment using venv:
python3 -m venv aii-env
source aii-env/bin/activate
Next, create a file named requirements.txt and populate it with the core libraries we will use. This list includes frameworks for deep learning, distributed training, data management, and cloud provider interaction.
requirements.txt:
# Deep Learning & Distributed Training
torch>=2.0.0
torchvision
torchaudio
deepspeed
transformers
accelerate
# Data & MLOps
dvc[s3] # or [gcs], [azure]
feast
pachyderm-sdk
pyarrow
pandas
# Kubernetes & Cloud SDKs
kubernetes
boto3 # For AWS
google-cloud-aiplatform # For GCP
azure-ai-ml # For Azure
# Utilities
numpy
scikit-learn
tqdm
Install these packages using pip:
pip install -r requirements.txt
Note: Depending on your chosen cloud provider, you may only need one of
boto3,google-cloud-aiplatform, orazure-ai-ml. Thedvcextra ([s3],[gcs],[azure]) should also correspond to your provider for remote storage access.
With your environment fully configured, you are now prepared to build and manage the high-performance systems central to this course. In the next chapter, we will use these tools to implement our first distributed model training job.
Was this section helpful?
kubectl installation and cluster management, which is important for distributed AI workloads.© 2026 ApX Machine LearningEngineered with