Practice: Configure a GPU-Aware Autoscaling Group

You will configure a GPU-enabled node pool that scales from zero, ensuring these expensive resources are only provisioned when actively needed. This setup is fundamental for building a cost-efficient, shared ML platform.

We will combine Kubernetes taints, tolerations, and resource requests to create a system where the Cluster Autoscaler can make intelligent, GPU-aware scaling decisions. You will deploy a pod that specifically requests a GPU and observe the entire automated process, from the pod's initial Pending state to the provisioning of a new GPU node and the final scheduling of the workload.

Prerequisites

Before you begin, ensure your environment is prepared with the following:

A functioning Kubernetes cluster on a cloud provider that supports GPU instances (like GKE, EKS, or AKS).
kubectl configured to communicate with your cluster.
The corresponding cloud command-line tool (e.g., gcloud, aws, az) authenticated.
The Kubernetes Cluster Autoscaler must be deployed and running in your cluster. We assume a baseline installation is already present, as discussed in the previous section. This exercise focuses on tailoring its behavior for GPU workloads.

The GPU Autoscaling Mechanism

The logic for GPU-aware autoscaling does not rely on a special "GPU mode" in the Cluster Autoscaler. Instead, it's an interaction of standard Kubernetes scheduling primitives. The process works as follows:

Tainting the Node Pool: We create a dedicated node pool for GPUs and apply a taint to it, for example, nvidia.com/gpu=present:NoSchedule. This taint prevents any pod from being scheduled on these nodes unless it explicitly has a matching toleration. This ring-fences our expensive GPU resources for workloads that truly need them.
Requesting the Resource: A pod manifest is defined to request a GPU via resources.limits and to include a toleration for the taint applied in the previous step.
Triggering the Scale-Up: When this pod is submitted, the Kubernetes scheduler attempts to place it. It finds no available nodes with a GPU that the pod can be scheduled on. The pod enters a Pending state.
Autoscaler Action: The Cluster Autoscaler detects the Pending pod. It analyzes the pod's requirements (GPU resource, toleration) and determines that adding a node from the tainted GPU node pool would allow the pod to be scheduled.
Provisioning: The Cluster Autoscaler calls the cloud provider's API to increase the desired node count of the GPU node pool by one.
Scheduling: Once the new node joins the cluster and reports itself as Ready, the Kubernetes scheduler places the pending pod onto it.

The following diagram illustrates this entire sequence of events.

The autoscaling flow from pod submission to successful scheduling on a newly provisioned GPU node.

Step 1: Create a Tainted, Autoscaling Node Pool

First, we define a node pool that is eligible for autoscaling and is specifically designated for GPU workloads. The two most important parameters are setting the minimum size to 0 and applying a NoSchedule taint. Setting min-nodes to zero is the foundation of our cost-saving strategy.

Here is an example command for creating such a node pool in Google Kubernetes Engine (GKE). The principles are identical for EKS or AKS, though the specific flags will differ.

# Example for GKE
gcloud container node-pools create gpu-pool \
  --project "<your-gcp-project>" \
  --cluster "<your-cluster-name>" \
  --zone "<your-cluster-zone>" \
  --machine-type "n1-standard-4" \
  --accelerator "type=nvidia-tesla-t4,count=1" \
  --enable-autoscaling \
  --min-nodes "0" \
  --max-nodes "5" \
  --node-taints "nvidia.com/gpu=present:NoSchedule" \
  --node-labels "app-type=gpu-workloads"

After running this command, you have a node pool named gpu-pool. It currently has zero nodes, but it's ready to scale up to five T4-equipped nodes. Any node created in this pool will automatically be tainted, repelling pods that do not have the correct toleration.

Step 2: Install the NVIDIA GPU Device Plugin

Kubernetes itself does not natively understand what a GPU is. It requires a device plugin to discover and expose the GPU hardware on a node. The NVIDIA GPU Operator is the recommended way to manage this, as it handles the driver installation, device plugin registration, and monitoring components automatically.

If you have not already installed it, add the NVIDIA Helm repository and install the operator.

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --wait \
  --namespace gpu-operator \
  --create-namespace

Once the operator's pods are running, any node with an NVIDIA GPU will be automatically labeled with nvidia.com/gpu=true and its GPU capacity will become visible to the Kubernetes scheduler as an allocatable resource named nvidia.com/gpu.

Step 3: Define a Pod to Trigger Scaling

Now, create the workload that will trigger the scale-up. The following YAML manifest defines a simple pod that does nothing but hold a GPU. Notice the two critical sections: resources.limits to request the GPU and tolerations to allow it to be scheduled on our tainted nodes.

Save this content as gpu-test-pod.yaml.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test-pod
spec:
  restartPolicy: Never
  containers:
  - name: cuda-container
    image: nvidia/cuda:11.4.2-base-ubuntu20.04
    command: ["/bin/bash", "-c", "sleep 3600"] # Sleep for 1 hour
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - important: "nvidia.com/gpu"
    operator: "Equal"
    value: "present"
    effect: "NoSchedule"

Step 4: Deploy the Pod and Observe the System

With all the pieces in place, apply the manifest and watch the automation unfold.

Deploy the Pod:
```
kubectl apply -f gpu-test-pod.yaml
```

Watch the Pod's Status: Immediately after, check its status. You will see it is Pending.

kubectl get pods -w

# Output will look like this initially
# NAME           READY   STATUS    RESTARTS   AGE
# gpu-test-pod   0/1     Pending   0          2s

Inspect the Pod's Events: To understand why it's pending, describe the pod. In the Events section, you will see a message from the scheduler indicating that it couldn't find a suitable node.
```
kubectl describe pod gpu-test-pod
```
You should see an event similar to: Warning FailedScheduling ... 0/X nodes are available: X node(s) had taints that the pod didn't tolerate. This is expected and is the trigger for the Cluster Autoscaler.
Check the Cluster Autoscaler Logs: In another terminal, tail the logs of the Cluster Autoscaler deployment. You will see it recognize the pending pod and trigger a scale-up of the gpu-pool.
```
# Find the autoscaler pod
kubectl get pods -n kube-system | grep cluster-autoscaler

# Tail its logs
kubectl logs -f <cluster-autoscaler-pod-name> -n kube-system
```
Look for lines containing Scale-up, pod gpu-test-pod triggered scale-up, and ... expanding node group .../gpu-pool from 0 to 1.
Watch the New Node Appear: While the autoscaler log is running, watch the nodes in your cluster. A new node from the gpu-pool will appear, initially in a NotReady state, then transitioning to Ready after a few minutes.
```
kubectl get nodes -w
```
Confirm Pod Scheduling: Once the new node is ready, the Kubernetes scheduler will automatically place gpu-test-pod onto it. The pod's status will change from Pending to ContainerCreating and finally to Running.

Step 5: Scaling Down and Cleanup

The system also handles scaling down to conserve costs. Once you are finished with the GPU workload, delete the pod.

kubectl delete -f gpu-test-pod.yaml

After the pod is deleted, the new GPU node is now idle. The Cluster Autoscaler will recognize this underutilization. Following its configured timeout period (typically 10 minutes), it will terminate the node and scale the gpu-pool back down to zero. You have successfully created a fully elastic, on-demand pool of GPU resources. To avoid any further costs from this exercise, you can also delete the node pool itself.

# Example for GKE
gcloud container node-pools delete gpu-pool \
  --cluster "<your-cluster-name>" \
  --zone "<your-cluster-zone>"

Was this section helpful?

References

Cluster Autoscaler, Kubernetes Authors, 2024 (The Linux Foundation) - Official documentation explaining the purpose, architecture, and configuration of the Kubernetes Cluster Autoscaler.
Taints and Tolerations, Kubernetes Authors, Current - Official guide to using taints and tolerations for node isolation and selective pod scheduling.
Resource Management for Pods and Containers, Kubernetes Authors, Current (The Kubernetes Project) - Describes how to specify resource requests and limits for pods, essential for GPU allocation.
NVIDIA GPU Operator, NVIDIA, 2024 (NVIDIA) - Official documentation for deploying and managing NVIDIA GPUs in Kubernetes using the GPU Operator.
Running GPUs on GKE, Google Cloud, Current (Google Cloud) - Guide for configuring and utilizing GPU-enabled nodes and workloads on Google Kubernetes Engine.