Now that you've containerized your diffusion model application using Docker, the next significant step is ensuring it can effectively access and utilize the specialized hardware it needs, specifically GPUs. Standard Docker containers are isolated from the host's hardware devices by default, including GPUs. This isolation is generally a security and portability feature, but for computationally intensive tasks like diffusion model inference, direct access to GPU acceleration is essential for performance.
To enable containers to use NVIDIA GPUs, you need a mechanism to bridge the gap between the container's isolated environment and the host system's NVIDIA drivers and hardware. This is precisely the role of the NVIDIA Container Toolkit.
Formerly known as NVIDIA Docker, this toolkit extends standard container runtimes (like Docker Engine or containerd) with the necessary components to make NVIDIA GPUs accessible inside containers. It achieves this through a combination of:
/dev/nvidia0
, /dev/nvidiactl
, /dev/nvidia-uvm
) on the host system.Crucially, this means you do not package the NVIDIA drivers inside your container image. The container uses the drivers installed on the host machine. This approach keeps container images smaller and ensures compatibility with the host's hardware setup, but it also introduces a dependency: the container relies on a compatible driver version being present on the host where it runs.
When running a container directly with Docker, you request GPU access using the --gpus
flag. This flag instructs the NVIDIA Container Toolkit (via the Docker daemon) to perform the necessary setup.
# Example: Run a container requesting access to all available GPUs
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
# Example: Request 2 specific GPUs by index
docker run --rm --gpus '"device=0,1"' my-diffusion-model-image:latest python infer.py --prompt "A cat wearing a hat"
# Example: Request GPUs based on capabilities (e.g., compute capability)
# docker run --rm --gpus 'all,capabilities=compute,utility' ...
The nvidia-smi
command run inside the container should list the GPUs made available to it, confirming that the toolkit has successfully exposed the hardware.
When deploying containerized applications at scale, orchestration platforms like Kubernetes become indispensable. Kubernetes requires its own mechanism to understand and manage specialized hardware like GPUs across a cluster of nodes. This is handled by Device Plugins.
The NVIDIA Device Plugin for Kubernetes is a component deployed on each GPU-enabled node in your cluster (typically as a DaemonSet). Its responsibilities include:
nvidia.com/gpu
).Once the device plugin is running, you can request GPUs in your Kubernetes Pod specifications using the standard resources.requests
and resources.limits
fields.
# Example Kubernetes Pod spec requesting one GPU
apiVersion: v1
kind: Pod
metadata:
name: diffusion-inference-pod
spec:
containers:
- name: diffusion-worker
image: my-diffusion-model-image:latest
command: ["python", "worker.py"]
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
requests:
nvidia.com/gpu: 1 # Request 1 GPU (optional, often same as limit)
# Add node selectors or tolerations if needed for GPU nodes
# nodeSelector:
# cloud.google.com/gke-accelerator: nvidia-tesla-t4
When this Pod is submitted, the Kubernetes scheduler looks for nodes that have at least one nvidia.com/gpu
resource available (as reported by the device plugin). Once scheduled, the Kubelet on the chosen node interacts with the device plugin and the container runtime (via the NVIDIA Container Toolkit) to mount the correct GPU device(s) into the container before it starts.
Diagram illustrating the components involved in making a host GPU accessible within a container, both directly via Docker and orchestrated by Kubernetes using a device plugin.
nvidia/cuda:<version>
) and the NVIDIA driver version installed on the host machine. Ensure the host driver is recent enough to support the CUDA version used inside the container. NVIDIA provides compatibility matrices for reference.nvidia-smi
inside the container shows allocated GPUs, monitoring overall GPU utilization, memory usage, and temperature across the cluster typically requires integrating node-level metrics (often exposed by the device plugin or tools like dcgm-exporter
) into your monitoring stack (e.g., Prometheus, Grafana). This is essential for autoscaling and performance tuning, covered later.requests
and limits
for nvidia.com/gpu
to the same value (usually 1
) is common for dedicated GPU workloads. This guarantees the resource for the Pod. Unlike CPU/memory, GPUs are generally not considered compressible or oversubscribable resources in the same way.Effectively managing GPU resources within containers is fundamental for deploying diffusion models reliably and efficiently. By leveraging the NVIDIA Container Toolkit and Kubernetes device plugins, you can integrate GPU acceleration seamlessly into your containerized workflows, allowing your optimized models to perform inference tasks at the required scale and speed. This infrastructure forms the basis for building scalable and resilient inference services.
© 2025 ApX Machine Learning