Advanced GPU Scheduling and Sharing

While Kubernetes can schedule pods to nodes with GPUs, its default behavior treats each accelerator as an indivisible unit. This "all-or-nothing" allocation often leads to significant resource waste, as many ML tasks, from interactive development to lightweight inference, do not require the full power of a modern GPU. To build a cost-effective and efficient ML platform, you must implement more sophisticated scheduling and sharing mechanisms.

This is where the NVIDIA GPU Operator for Kubernetes becomes indispensable. It automates the management of all necessary NVIDIA software components, including drivers, the container toolkit, and the device plugin, which exposes GPUs as schedulable resources. We will build on this foundation to implement two advanced sharing strategies: software-based time-slicing and hardware-based Multi-Instance GPU (MIG).

The NVIDIA GPU Operator: A Production Foundation

Before partitioning resources, you need a mechanism to manage them. The NVIDIA GPU Operator is the standard for production environments. It uses the operator pattern in Kubernetes to manage the lifecycle of GPU-related software on each node of your cluster. Once installed, the Kubernetes scheduler becomes aware of the nvidia.com/gpu resource, allowing you to request GPUs in your pod specifications.

A basic pod requesting a full GPU looks like this:

apiVersion: v1
kind: Pod
metadata:
  name: full-gpu-training-pod
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:11.8.0-base-ubuntu22.04
    command: ["/bin/bash", "-c", "--"]
    args: [ "while true; do nvidia-smi; sleep 30; done;" ]
    resources:
      limits:
        nvidia.com/gpu: 1 # Request one full, dedicated GPU

This works perfectly for large training jobs but is inefficient for smaller tasks.

GPU Sharing with Time-Slicing

Time-slicing allows multiple containers to share a single physical GPU. The GPU's scheduling mechanism rapidly switches context between the processes of different containers, giving each a slice of the GPU's execution time. This is a software-level solution that does not provide memory isolation but is highly effective for increasing utilization for workloads that are not compute-saturated or latency-sensitive, such as:

Jupyter notebooks for model development and exploration.
Low-traffic inference services.
Running multiple small-scale experiments in parallel.

You enable time-slicing by applying a configuration to the GPU Operator. This configuration defines how a GPU should be partitioned. For example, you can specify that a GPU should be divided into three "shares." The device plugin then advertises this fractional resource to the Kubernetes scheduler.

Time-slicing allows multiple pods to share a single physical GPU. The GPU's execution context switches between the processes, which is effective for workloads with intermittent compute needs but introduces performance overhead due to context switching.

A pod requesting one of these shares would use a resource limit like this:

# Assumes the node has been configured to offer 3 time-slices per GPU
# and advertises the resource 'nvidia.com/gpu.shared'.
apiVersion: v1
kind: Pod
metadata:
  name: dev-notebook-pod
spec:
  containers:
  - name: jupyter-container
    image: jupyter/tensorflow-notebook
    resources:
      limits:
        nvidia.com/gpu.shared: 1 # Request one share of a GPU

The main trade-off with time-slicing is performance. Context switching introduces latency, and since memory is not isolated, one "noisy neighbor" pod could exhaust GPU memory, causing CUDA memory errors in other pods sharing the same GPU.

Hardware Partitioning with Multi-Instance GPU (MIG)

For workloads requiring strict performance guarantees and security isolation, Multi-Instance GPU (MIG) is the superior solution. Available on NVIDIA Ampere architecture GPUs (like the A100 and H100) and newer, MIG partitions a single GPU into up to seven independent, hardware-isolated GPU instances.

Each MIG instance has its own:

Dedicated streaming multiprocessors (SMs).
Dedicated memory and memory controllers.
Dedicated L2 cache.

This hardware-level partitioning provides predictable performance and strong fault isolation. If one pod's kernel fails on its MIG instance, it does not affect other pods running on different instances on the same physical GPU. This makes MIG an excellent technology for secure multi-tenancy and for hosting multiple inference models with strict Quality of Service (QoS) requirements on a single GPU.

Multi-Instance GPU (MIG) partitions a single physical GPU into multiple, fully isolated GPU instances. Each instance has its own dedicated memory, cache, and compute resources, ensuring predictable performance and security.

When MIG is enabled on a node, the device plugin advertises the available MIG profiles as distinct resource types. A pod requests a specific MIG instance profile in its resource limits.

apiVersion: v1
kind: Pod
metadata:
  name: mig-inference-pod
spec:
  containers:
  - name: triton-server
    image: nvcr.io/nvidia/tritonserver:23.10-py3
    resources:
      limits:
        # Request a specific MIG profile: 1 GPC, 10GB memory
        nvidia.com/mig-1g.10gb: 1

Kubernetes will schedule this pod only on a node that has an available 1g.10gb MIG instance.

Choosing the Right Scheduling Strategy

The decision between default allocation, time-slicing, and MIG depends entirely on your workload's requirements for performance, isolation, and cost.

Feature	Best For	Isolation	Performance	Hardware
Default (1 Pod/GPU)	Heavy Training, HPC	Process-level	Maximum, dedicated	Any NVIDIA GPU
Time-Slicing	Dev Notebooks, Low-Traffic APIs	None (Shared Memory)	Variable, with overhead	Any NVIDIA GPU
MIG	Multi-Tenant Inference, Strict QoS	Strong (Hardware)	Predictable, partitioned	Ampere Arch & newer

By mastering these advanced scheduling techniques, you can transform a cluster of expensive GPUs from a collection of monolithic resources into a flexible, fine-grained, and efficient platform. This allows you to serve a wider variety of ML workloads, from development to production, on the same shared infrastructure, directly improving resource utilization and reducing operational costs. These fine-grained resource definitions also provide the necessary signals for the cluster autoscaler to make more intelligent decisions, a topic we will address next.

Was this section helpful?

References

NVIDIA GPU Operator Documentation, NVIDIA Corporation, 2024 - Provides guides on installing and configuring the NVIDIA GPU Operator, which automates management of GPU software components and enables GPU resource exposure in Kubernetes.
Multi-Instance GPU (MIG) User Guide, NVIDIA Corporation, 2024 (NVIDIA Corporation) - Details MIG architecture, setup, and resource allocation for hardware-isolated GPU partitioning, applicable for Ampere and newer GPUs.
Device Plugins, Kubernetes Authors, 2024 - Describes the Kubernetes device plugin framework, used by hardware vendors like NVIDIA to advertise specialized hardware resources to the scheduler.
NVIDIA Container Toolkit Documentation, NVIDIA Corporation, 2024 (NVIDIA Corporation) - Documents the NVIDIA Container Toolkit, which enables GPU-accelerated applications to run in Docker and other OCI-compatible containers.