While Kubernetes can schedule pods to nodes with GPUs, its default behavior treats each accelerator as an indivisible unit. This "all-or-nothing" allocation often leads to significant resource waste, as many ML tasks, from interactive development to lightweight inference, do not require the full power of a modern GPU. To build a cost-effective and efficient ML platform, you must implement more sophisticated scheduling and sharing mechanisms.
This is where the NVIDIA GPU Operator for Kubernetes becomes indispensable. It automates the management of all necessary NVIDIA software components, including drivers, the container toolkit, and the device plugin, which exposes GPUs as schedulable resources. We will build on this foundation to implement two advanced sharing strategies: software-based time-slicing and hardware-based Multi-Instance GPU (MIG).
Before partitioning resources, you need a mechanism to manage them. The NVIDIA GPU Operator is the standard for production environments. It uses the operator pattern in Kubernetes to manage the lifecycle of GPU-related software on each node of your cluster. Once installed, the Kubernetes scheduler becomes aware of the nvidia.com/gpu resource, allowing you to request GPUs in your pod specifications.
A basic pod requesting a full GPU looks like this:
apiVersion: v1
kind: Pod
metadata:
name: full-gpu-training-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:11.8.0-base-ubuntu22.04
command: ["/bin/bash", "-c", "--"]
args: [ "while true; do nvidia-smi; sleep 30; done;" ]
resources:
limits:
nvidia.com/gpu: 1 # Request one full, dedicated GPU
This works perfectly for large training jobs but is inefficient for smaller tasks.
Time-slicing allows multiple containers to share a single physical GPU. The GPU's scheduling mechanism rapidly switches context between the processes of different containers, giving each a slice of the GPU's execution time. This is a software-level solution that does not provide memory isolation but is highly effective for increasing utilization for workloads that are not compute-saturated or latency-sensitive, such as:
You enable time-slicing by applying a configuration to the GPU Operator. This configuration defines how a GPU should be partitioned. For example, you can specify that a GPU should be divided into three "shares." The device plugin then advertises this fractional resource to the Kubernetes scheduler.
Time-slicing allows multiple pods to share a single physical GPU. The GPU's execution context switches between the processes, which is effective for workloads with intermittent compute needs but introduces performance overhead due to context switching.
A pod requesting one of these shares would use a resource limit like this:
# Assumes the node has been configured to offer 3 time-slices per GPU
# and advertises the resource 'nvidia.com/gpu.shared'.
apiVersion: v1
kind: Pod
metadata:
name: dev-notebook-pod
spec:
containers:
- name: jupyter-container
image: jupyter/tensorflow-notebook
resources:
limits:
nvidia.com/gpu.shared: 1 # Request one share of a GPU
The main trade-off with time-slicing is performance. Context switching introduces latency, and since memory is not isolated, one "noisy neighbor" pod could exhaust GPU memory, causing CUDA memory errors in other pods sharing the same GPU.
For workloads requiring strict performance guarantees and security isolation, Multi-Instance GPU (MIG) is the superior solution. Available on NVIDIA Ampere architecture GPUs (like the A100 and H100) and newer, MIG partitions a single GPU into up to seven independent, hardware-isolated GPU instances.
Each MIG instance has its own:
This hardware-level partitioning provides predictable performance and strong fault isolation. If one pod's kernel fails on its MIG instance, it does not affect other pods running on different instances on the same physical GPU. This makes MIG an excellent technology for secure multi-tenancy and for hosting multiple inference models with strict Quality of Service (QoS) requirements on a single GPU.
Multi-Instance GPU (MIG) partitions a single physical GPU into multiple, fully isolated GPU instances. Each instance has its own dedicated memory, cache, and compute resources, ensuring predictable performance and security.
When MIG is enabled on a node, the device plugin advertises the available MIG profiles as distinct resource types. A pod requests a specific MIG instance profile in its resource limits.
apiVersion: v1
kind: Pod
metadata:
name: mig-inference-pod
spec:
containers:
- name: triton-server
image: nvcr.io/nvidia/tritonserver:23.10-py3
resources:
limits:
# Request a specific MIG profile: 1 GPC, 10GB memory
nvidia.com/mig-1g.10gb: 1
Kubernetes will schedule this pod only on a node that has an available 1g.10gb MIG instance.
The decision between default allocation, time-slicing, and MIG depends entirely on your workload's requirements for performance, isolation, and cost.
| Feature | Best For | Isolation | Performance | Hardware |
|---|---|---|---|---|
| Default (1 Pod/GPU) | Heavy Training, HPC | Process-level | Maximum, dedicated | Any NVIDIA GPU |
| Time-Slicing | Dev Notebooks, Low-Traffic APIs | None (Shared Memory) | Variable, with overhead | Any NVIDIA GPU |
| MIG | Multi-Tenant Inference, Strict QoS | Strong (Hardware) | Predictable, partitioned | Ampere Arch & newer |
By mastering these advanced scheduling techniques, you can transform a cluster of expensive GPUs from a collection of monolithic resources into a flexible, fine-grained, and efficient platform. This allows you to serve a wider variety of ML workloads, from development to production, on the same shared infrastructure, directly improving resource utilization and reducing operational costs. These fine-grained resource definitions also provide the necessary signals for the cluster autoscaler to make more intelligent decisions, a topic we will address next.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with