Containerizing a diffusion model application using Docker is a primary step in its deployment. Following this, a main consideration is ensuring the application can effectively access and utilize specialized hardware, specifically GPUs. By default, standard Docker containers are isolated from the host's hardware devices, including GPUs. While this isolation provides security and portability, direct access to GPU acceleration is essential for the performance of computationally intensive tasks such as diffusion model inference.
To enable containers to use NVIDIA GPUs, you need a mechanism to bridge the gap between the container's isolated environment and the host system's NVIDIA drivers and hardware. This is precisely the role of the NVIDIA Container Toolkit.
Formerly known as NVIDIA Docker, this toolkit extends standard container runtimes (like Docker Engine or containerd) with the necessary components to make NVIDIA GPUs accessible inside containers. It achieves this through a combination of:
/dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm) on the host system.Crucially, this means you do not package the NVIDIA drivers inside your container image. The container uses the drivers installed on the host machine. This approach keeps container images smaller and ensures compatibility with the host's hardware setup, but it also introduces a dependency: the container relies on a compatible driver version being present on the host where it runs.
When running a container directly with Docker, you request GPU access using the --gpus flag. This flag instructs the NVIDIA Container Toolkit (via the Docker daemon) to perform the necessary setup.
# Example: Run a container requesting access to all available GPUs
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
# Example: Request 2 specific GPUs by index
docker run --rm --gpus '"device=0,1"' my-diffusion-model-image:latest python infer.py --prompt "A cat wearing a hat"
# Example: Request GPUs based on capabilities (e.g., compute capability)
# docker run --rm --gpus 'all,capabilities=compute,utility' ...
The nvidia-smi command run inside the container should list the GPUs made available to it, confirming that the toolkit has successfully exposed the hardware.
When deploying containerized applications at scale, orchestration platforms like Kubernetes become indispensable. Kubernetes requires its own mechanism to understand and manage specialized hardware like GPUs across a cluster of nodes. This is handled by Device Plugins.
The NVIDIA Device Plugin for Kubernetes is a component deployed on each GPU-enabled node in your cluster (typically as a DaemonSet). Its responsibilities include:
nvidia.com/gpu).Once the device plugin is running, you can request GPUs in your Kubernetes Pod specifications using the standard resources.requests and resources.limits fields.
# Example Kubernetes Pod spec requesting one GPU
apiVersion: v1
kind: Pod
metadata:
name: diffusion-inference-pod
spec:
containers:
- name: diffusion-worker
image: my-diffusion-model-image:latest
command: ["python", "worker.py"]
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
requests:
nvidia.com/gpu: 1 # Request 1 GPU (optional, often same as limit)
# Add node selectors or tolerations if needed for GPU nodes
# nodeSelector:
# cloud.google.com/gke-accelerator: nvidia-tesla-t4
When this Pod is submitted, the Kubernetes scheduler looks for nodes that have at least one nvidia.com/gpu resource available (as reported by the device plugin). Once scheduled, the Kubelet on the chosen node interacts with the device plugin and the container runtime (via the NVIDIA Container Toolkit) to mount the correct GPU device(s) into the container before it starts.
Diagram illustrating the components involved in making a host GPU accessible within a container, both directly via Docker and orchestrated by Kubernetes using a device plugin.
nvidia/cuda:<version>) and the NVIDIA driver version installed on the host machine. Ensure the host driver is recent enough to support the CUDA version used inside the container. NVIDIA provides compatibility matrices for reference.nvidia-smi inside the container shows allocated GPUs, monitoring overall GPU utilization, memory usage, and temperature across the cluster typically requires integrating node-level metrics (often exposed by the device plugin or tools like dcgm-exporter) into your monitoring stack (e.g., Prometheus, Grafana). This is essential for autoscaling and performance tuning, covered later.requests and limits for nvidia.com/gpu to the same value (usually 1) is common for dedicated GPU workloads. This guarantees the resource for the Pod. Unlike CPU/memory, GPUs are generally not considered compressible or oversubscribable resources in the same way.Effectively managing GPU resources within containers is fundamental for deploying diffusion models reliably and efficiently. By using the NVIDIA Container Toolkit and Kubernetes device plugins, you can integrate GPU acceleration into your containerized workflows, allowing your optimized models to perform inference tasks at the required scale and speed. This infrastructure forms the basis for building scalable and resilient inference services.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with