You will configure a GPU-enabled node pool that scales from zero, ensuring these expensive resources are only provisioned when actively needed. This setup is fundamental for building a cost-efficient, shared ML platform.
We will combine Kubernetes taints, tolerations, and resource requests to create a system where the Cluster Autoscaler can make intelligent, GPU-aware scaling decisions. You will deploy a pod that specifically requests a GPU and observe the entire automated process, from the pod's initial Pending state to the provisioning of a new GPU node and the final scheduling of the workload.
Before you begin, ensure your environment is prepared with the following:
kubectl configured to communicate with your cluster.gcloud, aws, az) authenticated.The logic for GPU-aware autoscaling does not rely on a special "GPU mode" in the Cluster Autoscaler. Instead, it's an interaction of standard Kubernetes scheduling primitives. The process works as follows:
nvidia.com/gpu=present:NoSchedule. This taint prevents any pod from being scheduled on these nodes unless it explicitly has a matching toleration. This ring-fences our expensive GPU resources for workloads that truly need them.resources.limits and to include a toleration for the taint applied in the previous step.Pending state.Pending pod. It analyzes the pod's requirements (GPU resource, toleration) and determines that adding a node from the tainted GPU node pool would allow the pod to be scheduled.Ready, the Kubernetes scheduler places the pending pod onto it.The following diagram illustrates this entire sequence of events.
The autoscaling flow from pod submission to successful scheduling on a newly provisioned GPU node.
First, we define a node pool that is eligible for autoscaling and is specifically designated for GPU workloads. The two most important parameters are setting the minimum size to 0 and applying a NoSchedule taint. Setting min-nodes to zero is the foundation of our cost-saving strategy.
Here is an example command for creating such a node pool in Google Kubernetes Engine (GKE). The principles are identical for EKS or AKS, though the specific flags will differ.
# Example for GKE
gcloud container node-pools create gpu-pool \
--project "<your-gcp-project>" \
--cluster "<your-cluster-name>" \
--zone "<your-cluster-zone>" \
--machine-type "n1-standard-4" \
--accelerator "type=nvidia-tesla-t4,count=1" \
--enable-autoscaling \
--min-nodes "0" \
--max-nodes "5" \
--node-taints "nvidia.com/gpu=present:NoSchedule" \
--node-labels "app-type=gpu-workloads"
After running this command, you have a node pool named gpu-pool. It currently has zero nodes, but it's ready to scale up to five T4-equipped nodes. Any node created in this pool will automatically be tainted, repelling pods that do not have the correct toleration.
Kubernetes itself does not natively understand what a GPU is. It requires a device plugin to discover and expose the GPU hardware on a node. The NVIDIA GPU Operator is the recommended way to manage this, as it handles the driver installation, device plugin registration, and monitoring components automatically.
If you have not already installed it, add the NVIDIA Helm repository and install the operator.
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--wait \
--namespace gpu-operator \
--create-namespace
Once the operator's pods are running, any node with an NVIDIA GPU will be automatically labeled with nvidia.com/gpu=true and its GPU capacity will become visible to the Kubernetes scheduler as an allocatable resource named nvidia.com/gpu.
Now, create the workload that will trigger the scale-up. The following YAML manifest defines a simple pod that does nothing but hold a GPU. Notice the two critical sections: resources.limits to request the GPU and tolerations to allow it to be scheduled on our tainted nodes.
Save this content as gpu-test-pod.yaml.
apiVersion: v1
kind: Pod
metadata:
name: gpu-test-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvidia/cuda:11.4.2-base-ubuntu20.04
command: ["/bin/bash", "-c", "sleep 3600"] # Sleep for 1 hour
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- important: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
With all the pieces in place, apply the manifest and watch the automation unfold.
Deploy the Pod:
kubectl apply -f gpu-test-pod.yaml
Watch the Pod's Status:
Immediately after, check its status. You will see it is Pending.
kubectl get pods -w
# Output will look like this initially
# NAME READY STATUS RESTARTS AGE
# gpu-test-pod 0/1 Pending 0 2s
Inspect the Pod's Events:
To understand why it's pending, describe the pod. In the Events section, you will see a message from the scheduler indicating that it couldn't find a suitable node.
kubectl describe pod gpu-test-pod
You should see an event similar to: Warning FailedScheduling ... 0/X nodes are available: X node(s) had taints that the pod didn't tolerate. This is expected and is the trigger for the Cluster Autoscaler.
Check the Cluster Autoscaler Logs:
In another terminal, tail the logs of the Cluster Autoscaler deployment. You will see it recognize the pending pod and trigger a scale-up of the gpu-pool.
# Find the autoscaler pod
kubectl get pods -n kube-system | grep cluster-autoscaler
# Tail its logs
kubectl logs -f <cluster-autoscaler-pod-name> -n kube-system
Look for lines containing Scale-up, pod gpu-test-pod triggered scale-up, and ... expanding node group .../gpu-pool from 0 to 1.
Watch the New Node Appear:
While the autoscaler log is running, watch the nodes in your cluster. A new node from the gpu-pool will appear, initially in a NotReady state, then transitioning to Ready after a few minutes.
kubectl get nodes -w
Confirm Pod Scheduling:
Once the new node is ready, the Kubernetes scheduler will automatically place gpu-test-pod onto it. The pod's status will change from Pending to ContainerCreating and finally to Running.
The system also handles scaling down to conserve costs. Once you are finished with the GPU workload, delete the pod.
kubectl delete -f gpu-test-pod.yaml
After the pod is deleted, the new GPU node is now idle. The Cluster Autoscaler will recognize this underutilization. Following its configured timeout period (typically 10 minutes), it will terminate the node and scale the gpu-pool back down to zero. You have successfully created a fully elastic, on-demand pool of GPU resources. To avoid any further costs from this exercise, you can also delete the node pool itself.
# Example for GKE
gcloud container node-pools delete gpu-pool \
--cluster "<your-cluster-name>" \
--zone "<your-cluster-zone>"
Was this section helpful?
© 2026 ApX Machine LearningEngineered with