docker run
docker-compose.yml
Training machine learning models, especially deep learning models, often involves significant computational effort. While CPUs can handle smaller tasks, large datasets and complex architectures demand the parallel processing power of Graphics Processing Units (GPUs) to complete training in a reasonable timeframe. Running these demanding training jobs within containers adds another layer. How do you ensure your containerized training script can access the host machine's powerful GPU? Standard Docker containers are isolated from the host's hardware by default.
Fortunately, NVIDIA provides a solution specifically for this: the NVIDIA Container Toolkit. This toolkit extends Docker, allowing containers to access NVIDIA GPUs installed on the host system safely and efficiently. It manages the interaction between the container, the Docker runtime, and the host's NVIDIA drivers.
Before you can leverage GPUs within your Docker containers, your host system must meet a few requirements:
nvidia-container-toolkit
package (or its predecessors like nvidia-docker2
depending on your setup, though the toolkit is the modern standard). Installation instructions vary by Linux distribution and are available on the NVIDIA documentation website.The NVIDIA Container Toolkit acts as a bridge. When you instruct Docker to run a container with GPU access, the toolkit intercepts the command. It automatically mounts the necessary NVIDIA driver libraries and GPU device files from the host system into the container's filesystem. It also ensures the container has the correct permissions to interact with the GPU hardware. This way, the application running inside the container (like TensorFlow or PyTorch) can discover and utilize the GPU just as if it were running directly on the host, but without requiring the full NVIDIA drivers to be installed inside the container image itself.
Interaction between host components, the NVIDIA Container Toolkit, and a GPU-enabled Docker container.
While the NVIDIA Container Toolkit handles the runtime access, your Docker image still needs the necessary software libraries to use the GPU, primarily the CUDA toolkit and potentially libraries like cuDNN. The easiest way to achieve this is by using official base images provided by NVIDIA or popular ML frameworks:
nvidia/cuda:[version]-base-[os]
or nvidia/cuda:[version]-cudnn[version]-devel-[os]
: These provide CUDA and cuDNN environments.pytorch/pytorch:[version]-cuda[version]-cudnn[version]-runtime
: Official PyTorch images with specific CUDA/cuDNN versions.tensorflow/tensorflow:[version]-gpu
: Official TensorFlow images built with GPU support.Choosing one of these as your base image in the Dockerfile significantly simplifies setup, as they come pre-packaged with compatible versions of the required GPU libraries.
Here's a simple Dockerfile snippet using a TensorFlow GPU base image:
# Use an official TensorFlow GPU base image
FROM tensorflow/tensorflow:2.10.0-gpu
# Set the working directory
WORKDIR /app
# Copy requirements file first for layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the rest of the application code
COPY . .
# Command to run the training script by default
CMD ["python", "train.py"]
This Dockerfile doesn't contain explicit instructions about the GPU itself. The base image tensorflow/tensorflow:2.10.0-gpu
handles the inclusion of CUDA and cuDNN. The runtime configuration connects it to the host's GPU.
To grant a container access to the host's GPUs when you run it, you use the --gpus
flag with the docker run
command. This flag, managed by the NVIDIA Container Toolkit integration, tells Docker which GPUs to expose to the container.
Common usage patterns for --gpus
:
--gpus all
: Expose all available NVIDIA GPUs to the container.--gpus device=0
: Expose only the GPU with index 0.--gpus device=1,2
: Expose GPUs with indices 1 and 2.--gpus '"device=0,1"'
: Another syntax for multiple specific GPUs.--gpus count=2
: Expose the first 2 available GPUs.Here’s how you might run the training container built from the Dockerfile above, giving it access to all available GPUs:
docker run --rm --gpus all \
-v $(pwd)/data:/app/data \
-v $(pwd)/output:/app/output \
my-gpu-training-image:latest \
python train.py --data_dir /app/data --output_dir /app/output --epochs 50
This command:
docker run
.--gpus all
to enable access to all host GPUs../data
and ./output
directories using bind mounts (as covered in Chapter 3) to provide input data and store trained models/logs.my-gpu-training-image:latest
).CMD
to pass specific arguments to the training script.Once your container is running, you'll want to confirm it can actually see and use the GPU.
Using nvidia-smi
: The NVIDIA System Management Interface (nvidia-smi
) utility is usually included in the NVIDIA base images. You can run it inside the container:
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
If successful, this command will print the familiar nvidia-smi
output, listing the GPUs visible to the container.
Framework-Specific Checks: Your ML framework typically provides functions to check for GPU availability.
TensorFlow:
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
print("Num GPUs Available: ", len(gpus))
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
print("Memory growth enabled.")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
PyTorch:
import torch
available = torch.cuda.is_available()
count = torch.cuda.device_count()
name = torch.cuda.get_device_name(0) if available else "N/A"
print(f"CUDA Available: {available}")
print(f"Device Count: {count}")
print(f"Device Name (GPU 0): {name}")
Running these code snippets within your container (e.g., in an interactive session or as part of your script's startup) will confirm if the framework detects the GPU passed through by Docker and the NVIDIA Container Toolkit.
If you're managing your training environment with Docker Compose (covered in more detail later), you can request GPU resources within your docker-compose.yml
file using the deploy
key (available in Compose specification version 3.8+ and Docker Engine 19.03+).
version: '3.8'
services:
training:
build: .
image: my-gpu-training-image:latest
volumes:
- ./data:/app/data
- ./output:/app/output
environment:
- NVIDIA_VISIBLE_DEVICES=all # Optional, often handled by deploy section
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1 # Request one GPU
# Or specify capabilities:
# capabilities: [gpu]
# Or specific device IDs:
# device_ids: ['0']
command: python train.py --data_dir /app/data --output_dir /app/output
This configuration requests one NVIDIA GPU for the training
service. When you run docker compose up
, Compose will instruct the Docker Engine to allocate the GPU resource via the NVIDIA Container Toolkit.
By leveraging the NVIDIA Container Toolkit and appropriate base images, you can seamlessly integrate GPU acceleration into your containerized ML training workflows. This ensures your computationally intensive training jobs run efficiently while benefiting from the consistency and portability that Docker provides.
© 2025 ApX Machine Learning