Cluster Autoscaling for Dynamic ML Workloads

Machine learning workloads are inherently dynamic. A large-scale training job can demand dozens of GPUs for several hours or days, after which those resources sit idle. Conversely, a newly deployed inference service might experience a sudden surge in traffic, requiring an immediate increase in serving capacity. Managing a static cluster of compute nodes in such an environment leads to a difficult choice: overprovision resources and incur high costs for idle capacity, or underprovision and risk jobs waiting indefinitely in a queue.

The Kubernetes Cluster Autoscaler directly addresses this challenge by automatically adjusting the number of nodes in your cluster. It watches for pods that cannot be scheduled due to resource constraints and adds new nodes to accommodate them. It also consolidates workloads and removes underutilized nodes, providing a mechanism for elastic infrastructure that aligns resource supply with application demand.

The Autoscaling Mechanism

The Cluster Autoscaler operates on a simple yet effective control loop. Its primary trigger for scaling up is the presence of pods in the Pending state with a specific event reason: FailedScheduling. This event indicates that the Kubernetes scheduler could not find any existing node with sufficient available resources (like CPU, memory, or GPUs) to run the pod.

When the Cluster Autoscaler detects such a pod, it evaluates the configured node pools, which are groups of nodes with identical instance types managed by your cloud provider (e.g., AWS Auto Scaling Groups, GCP Managed Instance Groups). It simulates the scheduling of the pending pod onto a new node from each applicable pool and selects one that can satisfy the pod's resource requests.

Scaling down is driven by utilization. The autoscaler periodically checks the resource usage of all nodes. If a node's utilization drops below a configured threshold (typically around 50%) for a sustained period, and all pods running on it can be safely rescheduled elsewhere, the node becomes a candidate for removal. The autoscaler then drains the node, evicting the pods gracefully, and terminates the underlying cloud instance.

Configuring for GPU-Aware Autoscaling

A generic autoscaler configuration is insufficient for ML platforms. We must make it aware of specialized hardware like GPUs and ensure it makes intelligent decisions about which type of node to provision.

Multiple Node Pools

The standard practice is to segregate your cluster into multiple node pools based on hardware capabilities. For instance, you might have:

A cpu-general pool with cost-effective, CPU-optimized instances for general-purpose workloads and system components.
A gpu-training-a100 pool with powerful NVIDIA A100 GPUs for large-scale training.
An gpu-inference-t4 pool with NVIDIA T4 GPUs, which are optimized for cost-effective, low-latency inference.

When a training pod requests an A100 GPU, the autoscaler will identify that only the gpu-training-a100 pool can satisfy this request and will scale up that specific pool.

The autoscaling process for a GPU-requesting pod. The Cluster Autoscaler identifies the pending pod and provisions a new node from the correct GPU-equipped node pool.

Taints, Tolerations, and Affinity

To enforce this segregation and prevent non-GPU workloads from occupying expensive GPU nodes, we use taints and tolerations. You should apply a taint to all nodes in a GPU pool.

For example, a node in the A100 pool could be tainted with nvidia.com/gpu=present:NoSchedule. This taint prevents any pod from being scheduled on it unless the pod has a matching toleration.

A training pod that requires an A100 GPU would then include this toleration in its specification, along with a nodeSelector or nodeAffinity rule to explicitly request the correct hardware.

Here is a manifest snippet for a pod designed to run on a tainted A100 node:

apiVersion: v1
kind: Pod
metadata:
  name: large-model-training
spec:
  containers:
  - name: training-container
    image: my-registry/pytorch-a100:latest
    resources:
      limits:
        nvidia.com/gpu: 1 # This requests one GPU
  tolerations:
  - important: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
  nodeSelector:
    cloud.google.com/gke-accelerator: nvidia-tesla-a100 # Example for GKE

This combination ensures that:

The pod explicitly asks for a GPU resource.
The pod can be scheduled on a node tainted for GPU usage.
The pod is directed to a node with the specific A100 accelerator type.

When the Cluster Autoscaler sees this pending pod, it knows it must scale up a node pool that can provide a node matching all these criteria.

Fine-Tuning and Operational Notes

While the core mechanism is straightforward, production environments require careful tuning.

Expanders: The Cluster Autoscaler can be configured with an expander strategy to decide which node pool to scale when multiple options are available. Common strategies include least-waste, which picks the pool that will have the least idle CPU or memory after the pod is scheduled, and priority, which uses a user-defined priority list for node pools. For ML, least-waste is often a good starting point to improve cost efficiency.
Scale-Up Latency: Provisioning a new cloud VM, especially one with GPUs, is not instantaneous. It can take several minutes. For workloads sensitive to start-up time, you can use a cluster-overprovisioner. This involves deploying low-priority "pause" pods that reserve resources. When a high-priority ML pod arrives, it preempts a pause pod, which is then rescheduled, triggering the autoscaler to add a new node in the background. This keeps a buffer of "hot" capacity ready.
Scale-Down Behavior: The --scale-down-unneeded-time flag determines how long a node must be underutilized before it becomes eligible for termination. A short duration (e.g., 5 minutes) optimizes for cost but can lead to cluster "thrashing" if demand fluctuates rapidly. A longer duration (e.g., 20 minutes) provides more stability at a higher cost. Furthermore, you must protect certain pods from eviction during scale-down. Using a PodDisruptionBudget (PDB) for critical services ensures that the autoscaler will not remove a node if doing so would violate the budget (e.g., bring the number of available replicas for a service below a required minimum).

By integrating the Cluster Autoscaler with a well-architected set of node pools and scheduling primitives, you can build a truly elastic infrastructure platform. This system dynamically provides expensive GPU resources precisely when they are needed and relinquishes them when they are not, forming a critical component of a cost-effective and responsive AI platform.

Was this section helpful?

References

Cluster Autoscaler, Kubernetes Authors, 2023 - Official documentation for the Kubernetes Cluster Autoscaler, detailing its architecture, configuration, and operational guidelines.
Running GPU workloads on GKE, Google Cloud Documentation, 2023 - Provides practical guidance on configuring and autoscaling GPU-enabled node pools within Google Kubernetes Engine, directly relevant to the GPU-aware autoscaling section.