Hands-on Practical: Environment and Tooling Setup

Preparing your workstation and cloud environment for practical work is essential. These setup instructions cover the command-line tools and libraries you will use throughout the course. A properly configured environment is a prerequisite for successfully executing the distributed training, resource orchestration, and model deployment tasks in the following chapters.

We assume you have administrative access to your local machine and have an active account with at least one major cloud provider (AWS, GCP, or Azure) with permissions to create and manage compute and storage resources.

Core Command-Line Utilities

Your local machine will act as the control plane for orchestrating cloud resources. The following tools are essential for interacting with container and cluster management systems.

Docker

Containerization is fundamental to modern MLOps. We will use Docker to build and manage images for our training and inference applications. If you do not have it installed, download and install Docker Desktop for your operating system.

After installation, verify that the Docker daemon is running:

docker ps

This command should execute without errors and return an empty table of running containers.

Kubernetes CLI (kubectl)

kubectl is the primary command-line interface for interacting with any Kubernetes cluster. You will use it to deploy applications, inspect cluster resources, and manage networking configurations.

Install kubectl following the official Kubernetes documentation for your operating system. Verify the client-side installation with:

kubectl version --client

This will output the client version, confirming that the binary is in your system's PATH. We will configure its connection to a cloud-based cluster later.

Helm

Helm helps manage Kubernetes applications through "charts", which are pre-packaged sets of resource definitions. We will use Helm to deploy more complex systems like the NVIDIA Triton Inference Server and KubeFlow.

Install Helm using the instructions on its official website. Verify the installation by running:

helm version

Cloud Provider Configuration

You must install and configure the command-line interface for your chosen cloud provider. This enables programmatic access to provision infrastructure, such as Kubernetes clusters and GPU instances.

AWS CLI

For Amazon Web Services, install the AWS CLI. Once installed, configure it with your credentials:

aws configure

You will be prompted for your AWS Access Key ID, Secret Access Key, default region (e.g., us-east-1), and default output format (e.g., json). To confirm the setup is working, run the following command to check your identity:

aws sts get-caller-identity

Google Cloud SDK (gcloud)

For Google Cloud Platform, install the Google Cloud SDK, which includes the gcloud command-line tool. After installation, initialize the SDK:

gcloud init

This command will walk you through authenticating your account, selecting a project, and configuring a default region (e.g., us-central1) and zone. Verify authentication by listing your active projects:

gcloud projects list

Azure CLI

For Microsoft Azure, install the Azure CLI. Authenticate by running:

az login

This command will open a browser window for you to sign in. After authenticating, set your default subscription if you have more than one:

az account set --subscription "Your-Subscription-Name-or-ID"

Verify the setup by listing the available resource groups in your account:

az group list --output table

The developer's local workstation uses CLI tools to orchestrate and deploy applications to managed services within a cloud provider's environment.

Python Environment and Core Libraries

The hands-on exercises use Python 3.10 or newer. We strongly recommend using a dedicated virtual environment to manage dependencies and avoid conflicts with system-level packages.

Create and activate a virtual environment using venv:

python3 -m venv aii-env
source aii-env/bin/activate

Next, create a file named requirements.txt and populate it with the core libraries we will use. This list includes frameworks for deep learning, distributed training, data management, and cloud provider interaction.

requirements.txt:

# Deep Learning & Distributed Training
torch>=2.0.0
torchvision
torchaudio
deepspeed
transformers
accelerate

# Data & MLOps
dvc[s3] # or [gcs], [azure]
feast
pachyderm-sdk
pyarrow
pandas

# Kubernetes & Cloud SDKs
kubernetes
boto3        # For AWS
google-cloud-aiplatform # For GCP
azure-ai-ml  # For Azure

# Utilities
numpy
scikit-learn
tqdm

Install these packages using pip:

pip install -r requirements.txt

Note: Depending on your chosen cloud provider, you may only need one of boto3, google-cloud-aiplatform, or azure-ai-ml. The dvc extra ([s3], [gcs], [azure]) should also correspond to your provider for remote storage access.

With your environment fully configured, you are now prepared to build and manage the high-performance systems central to this course. In the next chapter, we will use these tools to implement our first distributed model training job.

Was this section helpful?

References

Docker Documentation, Docker Inc., 2024 - A comprehensive official resource for installing, configuring, and utilizing Docker, essential for containerized AI applications and development environments.
Kubernetes Documentation, The Kubernetes Authors, 2024 (Cloud Native Computing Foundation (CNCF)) - The official guide for Kubernetes, providing instructions for kubectl installation and cluster management, which is important for distributed AI workloads.
Designing Machine Learning Systems: An Introduction to MLOps, Chip Huyen, 2022 (O'Reilly Media) - Offers a foundational understanding of MLOps and the considerations for building scalable AI platforms, providing context for environment and tooling setup.