docker run
docker-compose.yml
Once you have defined the base environment and installed necessary dependencies in your Dockerfile
, the next step is to bring your own project code and related artifacts into the Docker image. This ensures that the container has everything it needs to execute your Machine Learning tasks, whether it's training a model or running an inference service. The primary instructions for this are COPY
and ADD
.
The COPY
instruction is the most straightforward and commonly used method for getting files from your local machine (specifically, the build context) into the image filesystem. Its syntax is simple:
COPY <src>... <dest>
<src>
: Specifies the file or directory on your local machine (relative to the build context root) that you want to copy. You can specify multiple sources. Wildcards are supported.<dest>
: Specifies the path inside the container image where the source files/directories should be copied. If the destination doesn't exist, Docker will create it. If the source is a directory, the destination must end with a /
or be an existing directory.For a typical ML project structured with a src
directory containing Python code and a requirements.txt
file in the root, you might use COPY
like this:
# Define the working directory (covered previously)
WORKDIR /app
# Copy the requirements file first to leverage caching
COPY requirements.txt .
# Install dependencies (covered previously)
RUN pip install --no-cache-dir -r requirements.txt
# Copy the rest of the project code
COPY ./src ./src
In this example:
/app
.requirements.txt
from the build context root to the /app
directory inside the image.pip install
.src
directory into a directory named src
within the /app
directory inside the image (i.e., /app/src
).This order is significant for build performance, as we'll discuss shortly regarding build caching.
Docker also provides the ADD
instruction, which has similar syntax to COPY
:
ADD <src>... <dest>
ADD
performs the same function as COPY
for local files and directories but includes two additional features:
<src>
is a URL, Docker will download the file from the URL and copy it to <dest>
. Permissions are set to 600.<src>
is a recognized compressed archive format (like tar
, gzip
, bzip2
, xz
), Docker will automatically unpack it into <dest>
as a directory.While these features might seem convenient, they can also make your builds less predictable. For instance, downloading from URLs during the build couples your image build process to network availability and the stability of the remote source. Auto-extraction can sometimes have unexpected results depending on the archive structure.
Recommendation: For clarity and predictability, prefer COPY
over ADD
for transferring local files and directories. Use ADD
only if you specifically need its URL download or auto-extraction capabilities, and understand the potential implications. For downloading files, often it's better practice to use tools like curl
or wget
within a RUN
instruction, providing more control over the download process (e.g., error handling, retries).
When you run docker build
, the Docker client first bundles the directory specified as the build context (usually .
) and sends it to the Docker daemon. This context includes all files and subdirectories. Sending large, unnecessary files (like datasets, virtual environments, Git history, IDE configuration) slows down the build process and can bloat your image if accidentally copied.
To prevent this, create a file named .dockerignore
in the root of your build context (the same directory as your Dockerfile
). List files or directories you want to exclude, using syntax similar to .gitignore
.
Here's an example .dockerignore
for a typical ML project:
# Git files
.git
.gitignore
# Python virtual environment
venv/
*.pyc
__pycache__/
# IDE / Editor specific files
.vscode/
.idea/
*.swp
# Large data files (manage via volumes/mounts later)
data/
datasets/
# Model artifacts (if large or managed separately)
models/
*.pt
*.h5
*.onnx
*.pkl
# Docker files
Dockerfile
.dockerignore
# Other temporary or local files
*.log
notebooks/output/
By using .dockerignore
, you ensure that only the essential code and configuration files are sent to the daemon and are available to be copied into your image via COPY
or ADD
. This leads to faster builds and smaller, more secure images.
Docker builds images in layers. Each instruction in the Dockerfile
(like RUN
, COPY
, ADD
) creates a new layer. Docker utilizes a build cache: if the files related to a COPY
instruction haven't changed since the last build, and the preceding layers are also cached, Docker reuses the existing layer instead of executing the instruction again.
This is particularly important for time-consuming steps like dependency installation. Consider these two approaches:
Less Optimal:
WORKDIR /app
COPY . . # Copy everything at once
RUN pip install -r requirements.txt
# ... rest of Dockerfile
If any file in your project changes (even a minor code edit), the COPY . .
layer becomes invalid, and Docker must re-run the potentially lengthy pip install
command, even if requirements.txt
didn't change.
More Optimal:
WORKDIR /app
# 1. Copy only the requirements file
COPY requirements.txt .
# 2. Install dependencies
RUN pip install --no-cache-dir -r requirements.txt
# 3. Copy the rest of the application code
COPY . .
# ... rest of Dockerfile
In this improved version:
.py
files) changes, Docker reuses the cached layers up to and including the RUN pip install
step (because requirements.txt
hasn't changed). It only needs to re-execute the final COPY . .
instruction, which is very fast.requirements.txt
itself is modified.Structuring your COPY
instructions thoughtfully, separating dependency definitions from application code, significantly speeds up iterative development cycles.
Should you COPY
pre-trained models, datasets, or other large artifacts directly into your image?
Generally, for datasets and larger or frequently updated model artifacts, copying them directly into the image is not the recommended approach. Alternative strategies using Docker volumes or bind mounts provide more flexibility and efficiency. These techniques allow you to decouple the data and artifacts from the image itself, making updates easier and keeping images smaller. We will explore these data management techniques in detail in Chapter 3, "Managing Data and Models in Containers".
For now, understand that COPY
and ADD
are essential for getting your application code and configuration into the image. Use .dockerignore
to exclude unnecessary files, and structure your COPY
operations to maximize the benefits of Docker's build cache.
© 2025 ApX Machine Learning