docker run
docker-compose.yml
While running individual training jobs using docker run
offers significant benefits for reproducibility, as discussed earlier in this chapter, many real-world machine learning workflows involve more than just a single training script container. You might need a database to fetch training data or log metrics, a message queue to trigger jobs, or perhaps a separate service to manage experiment configurations. Managing the lifecycle, networking, and data persistence for multiple interconnected containers using only docker run
commands quickly becomes cumbersome and error-prone.
This is where Docker Compose comes into play. Docker Compose is a tool specifically designed for defining and running multi-container Docker applications. Instead of issuing complex docker run
commands for each container, you describe your entire application stack, including its services, networks, and volumes, in a single declarative YAML file, typically named docker-compose.yml
.
Think of a scenario where your training script needs to:
Attempting to orchestrate this with individual docker run
commands would involve manually setting up networks for communication, managing container startup order, and handling volume mounts consistently across containers. Docker Compose simplifies this significantly.
With Compose, you define each component (the training script container, the database container, the configuration service container) as a separate service within the docker-compose.yml
file. Compose handles the underlying Docker operations, such as:
docker-compose up
start all defined services in the correct dependency order (if specified), and docker-compose down
stops and removes them cleanly.Consider a simplified training setup involving just the training container and a PostgreSQL database for logging results. A docker-compose.yml
file might look something like this:
version: '3.8' # Specifies the Compose file version
services:
# Service for the training script
training:
build: . # Instructs Compose to build an image from the Dockerfile in the current directory
volumes:
- ./data:/app/data # Mount local data directory into the container
- model_output:/app/output # Mount a named volume for model output
environment:
- DB_HOST=db # Use the service name 'db' as the hostname
- DB_NAME=training_logs
- DB_USER=trainer
- DB_PASSWORD=secretpass
depends_on: # Ensures the database starts before the training service
- db
# command: python train.py --config /app/config.yaml # Optional command override
# Service for the PostgreSQL database
db:
image: postgres:14-alpine # Use a pre-built PostgreSQL image
volumes:
- db_data:/var/lib/postgresql/data # Mount a named volume for persistent DB data
environment:
- POSTGRES_DB=training_logs
- POSTGRES_USER=trainer
- POSTGRES_PASSWORD=secretpass
# Define named volumes for persistent storage
volumes:
db_data:
model_output:
In this example:
services
: training
and db
.training
service builds its image using a local Dockerfile
.db
service uses a public postgres
image.volumes
are used for input data (./data
bind mount), output models (model_output
named volume), and persistent database storage (db_data
named volume).environment
variables configure both services, notably providing the database connection details to the training
service. The DB_HOST
is set to db
, the service name of the PostgreSQL container, which Compose makes resolvable on the internal network.depends_on
ensures the db
service is started before the training
service attempts to connect.Diagram illustrating services defined in a
docker-compose.yml
file. Services liketraining
anddb
communicate over a shared network, managed by Compose. Volumes (model_output
,db_data
) provide persistent storage. An optional configuration service is also shown.
With the docker-compose.yml
file defined, starting the entire stack is as simple as running:
docker-compose up --build
The --build
flag tells Compose to build the image for the training
service before starting the containers. If the image already exists and the Dockerfile
or its context hasn't changed, Compose will reuse the existing image unless --build
is specified.
To stop and remove the containers, network, and potentially the volumes defined in the Compose file, you use:
docker-compose down
Using Docker Compose significantly simplifies the management of multi-container setups common in more developed ML training pipelines. It provides a standardized way to define and share development or testing environments that closely mirror aspects of production deployments, enhancing reproducibility beyond just the single training container. While powerful orchestration systems like Kubernetes are often used for large-scale distributed training, Docker Compose is an invaluable tool for local development, testing, and managing moderately complex training stacks. The following sections will delve deeper into specific techniques for containerizing training scripts and handling associated aspects like configuration and GPU usage.
© 2025 ApX Machine Learning