The principles of cloud infrastructure are best understood through direct application. This content guides you through provisioning, configuring, and accessing a GPU-powered virtual machine on a major cloud platform. The exercise demonstrates a foundational, repeatable workflow for setting up a remote environment for machine learning development and training.
While the steps are demonstrated using Amazon Web Services (AWS), the process is similar for Google Cloud Platform (GCP) and Microsoft Azure. The objective is to understand the core components: selecting an instance, choosing a pre-configured software image, managing network access, and connecting securely.
Before you begin, ensure you have the following ready:
aws configure.Our first decision is to select a virtual machine configuration. As discussed previously, this involves balancing performance and cost. For this exercise, we will use an AWS g4dn.xlarge instance. It features an NVIDIA T4 GPU, which provides a good entry point for general-purpose ML tasks without the higher cost of top-tier training accelerators.
Next, we must select an Amazon Machine Image (AMI). An AMI is a template that contains the operating system and pre-installed software. To save significant setup time, we will use an official AWS Deep Learning AMI. These images come with NVIDIA drivers, CUDA, cuDNN, and major ML frameworks like TensorFlow and PyTorch already installed.
We will look for an AMI named something similar to Deep Learning AMI GPU TensorFlow X.X.X (Ubuntu 20.04). You can find the latest AMI ID for your chosen region in the AWS EC2 console or by using the AWS CLI. For this example, we will use a placeholder AMI ID. You must replace ami-0123456789abcdef0 with a valid AMI ID from your region.
Before launching the instance, we need to define who can access it. This is managed through a security group, which acts as a virtual firewall. For this lab, we only need to allow inbound SSH traffic (on port 22) from our own IP address.
We also need an SSH key pair to authenticate our connection. You can create a new key pair through the AWS console or with the CLI. The following command creates a key pair named ai-infra-key and saves the private key to a local file named ai-infra-key.pem.
aws ec2 create-key-pair --key-name ai-infra-key --query 'KeyMaterial' --output text > ai-infra-key.pem
Important: Secure this .pem file. It is the only way to access your instance. You also need to change its file permissions.
chmod 400 ai-infra-key.pem
With our instance type, AMI, and security settings defined, we are ready to launch the virtual machine. The following AWS CLI command bundles all these configurations into a single request.
This command performs several actions:
--image-id: Specifies the Deep Learning AMI. Remember to replace the placeholder.--instance-type: Sets the hardware to g4dn.xlarge.--key-name: Associates the SSH key pair we just created.--security-group-ids: You should first create a security group that allows port 22 access and use its ID here (e.g., sg-012345abcdef).--tag-specifications: Assigns a descriptive name to the instance, making it easy to find later.# First, find a valid Deep Learning AMI ID in your region (e.g., us-east-1)
# aws ec2 describe-images --owners amazon --filters "Name=name,Values=Deep Learning AMI GPU TensorFlow*" --query 'Images[?CreationDate>`2023-01-01`].{ID:ImageId,Name:Name}' --region us-east-1
# Then, create a security group
# aws ec2 create-security-group --group-name my-gpu-sg --description "SG for GPU instance"
# aws ec2 authorize-security-group-ingress --group-name my-gpu-sg --protocol tcp --port 22 --cidr YOUR_IP_ADDRESS/32
# Now, launch the instance with your AMI and SG IDs
aws ec2 run-instances \
--image-id ami-0123456789abcdef0 \
--instance-type g4dn.xlarge \
--key-name ai-infra-key \
--security-group-ids sg-012345abcdef \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=AI-Infra-Lab-Instance}]'
After running this command, AWS will return a JSON object describing the new instance. Note the InstanceId from the output, as you will need it to manage the instance.
It takes a few minutes for the instance to initialize. You can get its public IP address using the InstanceId.
# Replace i-012345abcdef with your actual InstanceId
aws ec2 describe-instances --instance-ids i-012345abcdef --query 'Reservations[].Instances[].PublicIpAddress' --output text
Once you have the IP address, connect to the instance using SSH. The default username for Ubuntu-based AMIs on AWS is ubuntu.
ssh -i "ai-infra-key.pem" ubuntu@YOUR_INSTANCE_PUBLIC_IP
If the connection is successful, you will see the command prompt of your new cloud server. The first thing to do is verify that the GPU is recognized and the drivers are working correctly.
Run the NVIDIA System Management Interface (nvidia-smi) tool:
nvidia-smi
You should see an output table detailing the NVIDIA driver version, CUDA version, and information about the attached GPU, in this case, a Tesla T4.
The workflow for provisioning and verifying a cloud GPU instance.
As a final check, let's run a short Python script to confirm that PyTorch can access the GPU.
# Save this as verify_gpu.py
import torch
if torch.cuda.is_available():
device = torch.device("cuda")
print(f"PyTorch has access to a CUDA device: {torch.cuda.get_device_name(0)}")
# Create a tensor and move it to the GPU
x = torch.tensor([1.0, 2.0, 3.0], device=device)
print(f"Tensor successfully created on the GPU: {x}")
print(f"Tensor device: {x.device}")
else:
print("Error: PyTorch cannot find a CUDA-enabled GPU.")
Run the script from your SSH session: python3 verify_gpu.py. A successful output confirms your environment is fully operational.
Cloud resources incur costs as long as they are running. A "stopped" instance may not incur compute charges, but its storage volume (EBS) will still be billed. To avoid any further costs from this lab, you must terminate the instance.
Warning: Termination is an irreversible action. All data on the instance's local storage will be permanently deleted.
Use the InstanceId you noted earlier to terminate the instance.
# Replace i-012345abcdef with your actual InstanceId
aws ec2 terminate-instances --instance-ids i-012345abcdef
This command will schedule the instance for termination. After a few minutes, it will be completely removed, and billing will cease. Always double-check in the AWS Management Console to ensure your resources are terminated after completing your work. This habit is one of the most important aspects of effective cloud cost management.
Was this section helpful?
nvidia-smi utility.© 2026 ApX Machine LearningEngineered with