A Docker image defines a blueprint for an application, creating a consistent, portable environment. This blueprint is a simple text file known as a Dockerfile. Think of it as a recipe. Each line in the file is an instruction that tells Docker how to assemble the image layer by layer, starting from a base operating system and adding libraries, code, and configurations on top.
For machine learning applications, this layering is particularly important. A typical ML image is not just your code and Python. It must also include the specific, and often large, CUDA toolkit and cuDNN libraries required to communicate with the host machine's GPU.
The layered structure of a typical machine learning Docker image. Each layer builds upon the previous one and is cached by Docker, leading to faster rebuilds when only the top layers change.
While a Dockerfile can have many instructions, a few core ones do most of the work. Let's examine the ones most relevant to building an ML environment.
Every Dockerfile must begin with a FROM instruction. It specifies the parent image from which you are building. The choice of base image is a significant decision for an ML application.
Standard Python Image: You could start with an official Python image, like python:3.9-slim. This is lightweight and great for CPU-only applications. However, it contains no NVIDIA libraries and cannot be used for GPU acceleration.
NVIDIA CUDA Image: For GPU-accelerated workloads, the recommended approach is to use an official base image from NVIDIA. An image like nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 comes pre-packaged with a specific version of the CUDA toolkit and cuDNN libraries. This saves you the complex and error-prone process of installing the NVIDIA drivers and toolkits yourself. The runtime tag indicates it has the libraries needed to run a pre-compiled CUDA application, while a devel tag would include the full SDK for compiling CUDA code. For most ML training, the runtime image is sufficient.
Note on Reproducibility Always use a specific version tag for your base image (e.g.,
11.8.0-cudnn8-runtime-ubuntu22.04) instead of a generic one likelatest. Thelatesttag can be updated by the publisher at any time, which could break your build in the future. Pinning to a specific version ensures your environment is reproducible.
The WORKDIR instruction sets the working directory for any subsequent RUN, CMD, ENTRYPOINT, COPY, and ADD instructions. It is a best practice to set a WORKDIR early to keep your container's filesystem organized.
The COPY instruction copies files and directories from your local machine into the container's filesystem. A common pattern is to first COPY only the file that lists your dependencies, install them, and then COPY the rest of your application code. This uses Docker's layer caching. If your application code changes but your dependencies do not, Docker can reuse the cached layer where the libraries were installed, making subsequent builds much faster.
# Set the working directory
WORKDIR /app
# Copy only the requirements file first
COPY requirements.txt .
The RUN instruction executes commands in a new layer on top of the current image and commits the results. This is how you install your Python packages. You can chain commands together with && and a \ for line continuation. This executes all commands within a single RUN instruction, creating only one new layer and keeping your final image size smaller.
# Install system dependencies and Python packages in a single layer
RUN apt-get update && \
apt-get install -y python3-pip && \
pip3 install --no-cache-dir -r requirements.txt
Using the --no-cache-dir flag with pip prevents it from storing the package cache, which is unnecessary inside the final image and helps reduce its size.
The CMD instruction provides the default command to execute when a container is started from your image. There can only be one CMD in a Dockerfile. If you want to run a training script named train.py, your CMD would look like this:
CMD ["python3", "train.py", "--epochs", "10"]
This is the "exec form" of CMD, which is the preferred syntax. It doesn't invoke a command shell and avoids potential issues with signal handling.
Let's put this all together. Assume we have a simple project structure:
.
├── Dockerfile
├── requirements.txt
└── train.py
Our requirements.txt file specifies the libraries we need:
torch==1.13.1
torchvision==0.14.1
numpy==1.23.5
Here is a complete Dockerfile to containerize this application for GPU training:
# 1. Use a specific NVIDIA CUDA runtime image as the base
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
# 2. Set environment variables to prevent interactive prompts during installation
ENV DEBIAN_FRONTEND=noninteractive
# 3. Set the working directory inside the container
WORKDIR /app
# 4. Copy the requirements file to leverage Docker's build cache
COPY requirements.txt .
# 5. Update package lists, install Python and Pip, then install Python packages
# Chaining commands reduces the number of image layers
RUN apt-get update && \
apt-get install -y python3 python3-pip && \
pip3 install --no-cache-dir -r requirements.txt && \
rm -rf /var/lib/apt/lists/*
# 6. Copy the rest of the application code into the working directory
COPY . .
# 7. Specify the command to run when the container starts
CMD ["python3", "train.py"]
With the Dockerfile in place, you build the image using the docker build command from your terminal. The -t flag tags the image with a human-readable name and version.
# Build the Docker image from the current directory
docker build -t pytorch-training-app:1.0 .
After the build completes, you have a self-contained, portable image. To run it and give it access to the host's GPUs, you use the docker run command with the --gpus all flag. This flag is handled by the NVIDIA Container Toolkit on the host machine, which automatically mounts the necessary GPU drivers and libraries into the container.
# Run the container, granting it access to all available host GPUs
docker run --gpus all pytorch-training-app:1.0
You have now successfully packaged an ML application with its specific CUDA and Python dependencies into a Docker image, ready to be run on any machine with Docker and NVIDIA GPUs installed. This forms the foundation for building the scalable and reproducible systems we will discuss next.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with