docker run
docker-compose.yml
Machine Learning projects often depend on a specific set of Python libraries, such as Scikit-learn, Pandas, NumPy, TensorFlow, or PyTorch, along with their precise versions. Ensuring that every team member, testing environment, and deployment target uses the exact same dependencies is fundamental for reproducibility. A slight difference in a library version can lead to subtle bugs, different model behavior, or outright failures. Docker provides a robust way to package these dependencies along with your application code, and pip
, the standard Python package installer, is a common tool used within Dockerfiles to manage these libraries.
The most straightforward way to install a Python package inside a Docker image is using the RUN
instruction combined with a pip install
command.
# Example: Installing scikit-learn directly
FROM python:3.9-slim
RUN pip install scikit-learn
This command instructs Docker to execute pip install scikit-learn
during the image build process. While simple for one or two packages, this approach quickly becomes unwieldy for projects with multiple dependencies. It also doesn't explicitly define the required versions, potentially leading to different versions being installed each time the image is built.
A more organized and reproducible approach is to list your project's Python dependencies in a requirements.txt
file. This is a standard practice in Python development.
A typical requirements.txt
file looks like this:
# requirements.txt
pandas==2.0.3
scikit-learn==1.3.0
numpy>=1.24.0,<1.26.0
matplotlib==3.7.2
Key recommendations for your requirements.txt
:
==
to specify exact versions (e.g., pandas==2.0.3
). This guarantees that the same version is installed every time, preventing unexpected changes due to library updates.numpy>=1.24.0,<1.26.0
) can be used, they introduce a small risk of variability if a new patch version fitting the range is released between builds. For maximum reproducibility in ML, pinning exact versions is often preferred.pip freeze > requirements.txt
in a clean local virtual environment after installing your primary dependencies. This captures the exact versions of all packages, including those installed indirectly, ensuring a complete snapshot of the environment.To use this file within your Dockerfile, you first COPY
it into the image context and then run pip install
with the -r
flag:
# Example: Using requirements.txt
FROM python:3.9-slim
WORKDIR /app
# Copy only the requirements file first
COPY requirements.txt .
# Install dependencies
RUN pip install -r requirements.txt
# Now copy the rest of the application code
COPY . .
# Define how to run the application (example)
# CMD ["python", "train.py"]
Docker builds images in layers. Each instruction in the Dockerfile (like COPY
, RUN
, WORKDIR
) creates a new layer. If the files or commands related to a layer haven't changed since the last build, Docker reuses the cached layer instead of rebuilding it. This significantly speeds up build times.
Notice the order in the previous example:
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
This structure is intentional and optimizes caching. Installing dependencies (pip install
) is often time-consuming. By copying only the requirements.txt
file and installing dependencies before copying the rest of your application code, we ensure that the dependency installation layer is only rebuilt if requirements.txt
itself changes. Frequent changes to your source code (COPY . .
) will only invalidate the cache from that point onwards, reusing the potentially large and slow-to-build dependency layer.
If you copied all your code first (COPY . .
) and then ran pip install -r requirements.txt
, any change to any file in your project would cause Docker to re-run the lengthy pip install
step, negating the benefits of the build cache for dependencies.
--no-cache-dir
When pip
installs packages, it typically caches the downloaded wheel files for potential reuse. While useful locally, this cache is often unnecessary inside a Docker image and contributes to its final size. You can prevent this caching using the --no-cache-dir
flag:
# Example: Using --no-cache-dir
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
# Install dependencies without caching downloads
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# CMD ["python", "your_script.py"]
Using --no-cache-dir
is a common practice for creating smaller production images, as the intermediate cache is unlikely to be needed once the image is built.
By combining requirements.txt
for defining dependencies, careful ordering of COPY
and RUN
instructions to optimize caching, and flags like --no-cache-dir
to minimize size, you can effectively manage Python dependencies using pip
within your Dockerfiles, creating lean, reproducible, and efficient environments for your Machine Learning projects.
© 2025 ApX Machine Learning