docker run
docker-compose.yml
docker run
Once you have built a Docker image containing your training script and all its dependencies, the next step is to execute the training process within a container instance derived from that image. The primary command for this is docker run
. Using docker run
ensures your training environment is precisely what you defined in your Dockerfile, isolating it from variations on the host machine and making your results more reproducible.
This section focuses on using the docker run
command effectively to launch, manage, and configure containerized machine learning training jobs.
At its core, docker run
creates and starts a new container from a specified image. If your Dockerfile includes an ENTRYPOINT
or CMD
instruction that points to your training script, running the container might be as simple as:
docker run your-ml-training-image:latest
This command instructs Docker to:
your-ml-training-image
with the tag latest
.ENTRYPOINT
or CMD
in the Dockerfile.Often, you'll want to provide specific arguments to your training script, such as hyperparameters or data paths, or even run a different script within the image. You can do this by appending the command and its arguments after the image name:
# Run train.py with specific arguments
docker run your-ml-training-image:latest python train.py --epochs 20 --batch-size 64
# Run a different script, e.g., data preprocessing
docker run your-ml-training-image:latest python preprocess_data.py --input /raw_data --output /processed_data
When you provide a command after the image name, it typically overrides the CMD
instruction in the Dockerfile. If an ENTRYPOINT
is defined, the command you provide might be passed as arguments to the ENTRYPOINT
.
Training jobs are rarely self-contained; they need access to datasets and produce outputs like trained models, logs, or evaluation metrics. As discussed in Chapter 3, Docker volumes and bind mounts are the standard mechanisms for this.
You integrate these data management techniques using the -v
or --mount
flag with docker run
.
Using Bind Mounts: Useful for development or when data resides directly on the host machine.
# Mount local ./data to /app/data in the container
# Mount local ./output to /app/output in the container
docker run \
-v $(pwd)/data:/app/data:ro \
-v $(pwd)/output:/app/output \
your-ml-training-image:latest \
python train.py --data-dir /app/data --model-dir /app/output/models
In this example, $(pwd)/data
(the data
directory in the current host working directory) is mounted read-only (:ro
) inside the container at /app/data
. The $(pwd)/output
directory is mounted read-write at /app/output
. The training script inside the container accesses data via /app/data
and saves models to /app/output/models
, which appear directly in the ./output/models
directory on the host.
Using Docker Volumes: Preferred for managing persistent data independent of the host filesystem structure.
# Assume volumes 'training_data_v1' and 'model_artifacts_v1' exist
docker run \
-v training_data_v1:/app/data:ro \
-v model_artifacts_v1:/app/output \
your-ml-training-image:latest \
python train.py --data-dir /app/data --model-dir /app/output/models
Here, Docker-managed volumes are mounted into the container. This decouples data storage from the host's directory layout.
Remember to ensure the paths inside the container (/app/data
, /app/output/models
) match what your training script expects.
As covered previously, environment variables and command-line arguments are common ways to configure training jobs. docker run
provides flags for both:
Environment Variables (-e
or --env
): Suitable for passing configuration like API keys, learning rates, or flags.
docker run \
-e LEARNING_RATE=0.005 \
-e NUM_EPOCHS=30 \
-e WANDB_API_KEY=your_secret_key \
-v ... \
your-ml-training-image:latest \
python train.py
# Assumes train.py reads LEARNING_RATE, NUM_EPOCHS from environment
Command-Line Arguments: Passed directly after the image name (or after the overriding command).
docker run \
-v ... \
your-ml-training-image:latest \
python train.py --learning-rate 0.005 --epochs 30 --log-to-wandb
The choice between them often depends on the training script's design. Environment variables are useful for secrets or settings that might apply across different script executions, while command-line arguments are explicit and often used for run-specific parameters like hyperparameters.
By default, docker run
attaches your terminal to the container's standard input, output, and error streams. You'll see the training logs directly in your terminal, and the command prompt will block until the container exits. This is useful for short jobs or interactive debugging.
For long-running training jobs, you'll typically want to run the container in detached mode using the -d
flag:
docker run -d \
--name long_training_run \
-v training_data_v1:/app/data:ro \
-v model_artifacts_v1:/app/output \
-e LEARNING_RATE=0.001 \
your-ml-training-image:latest \
python train.py --data-dir /app/data --model-dir /app/output/models
This command starts the container in the background and prints the container ID. Your terminal prompt returns immediately.
To view the logs of a detached container, use the docker logs
command:
# Follow logs in real-time
docker logs -f long_training_run
# Show all existing logs
docker logs long_training_run
Machine learning training can be computationally intensive. docker run
allows you to limit the resources a container can consume, preventing a single job from monopolizing host resources.
--cpus
): Specify the number of CPU cores the container can use.--memory
): Set a maximum amount of RAM.docker run -d \
--cpus="4" \
--memory="16g" \
--name resource_limited_training \
-v ... \
your-ml-training-image:latest \
python train.py ...
This ensures the container uses at most 4 CPU cores and 16 GB of RAM.
Naming Containers (--name
): Assigning a memorable name makes it easier to manage containers (e.g., view logs, stop, remove) instead of relying on auto-generated IDs.
docker run -d --name my_experiment_run_001 ...
docker logs my_experiment_run_001
docker stop my_experiment_run_001
Automatic Cleanup (--rm
): For training jobs that are typically run once and don't need to be inspected after completion, the --rm
flag is very useful. It automatically removes the container's filesystem when the container exits. This prevents cluttering your system with stopped containers.
# Container will be removed automatically upon completion or error
docker run --rm \
--name transient_training \
-v ... \
your-ml-training-image:latest \
python train.py ...
Note: --rm
cannot be used with -d
. If you run a job detached and want it removed after completion, you'll need to remove it manually using docker rm <container_name_or_id>
.
docker run
Process for TrainingThe docker run
command orchestrates several components to launch your training job within an isolated environment.
Diagram illustrating the flow when using
docker run
for a training job. The command initiates the process, instructing the Docker daemon to create a container from an image, mount requested data volumes or directories, inject environment variables, and execute the specified training script.
By mastering docker run
with its various flags for data mounting, configuration, and lifecycle management, you gain precise control over how your ML training jobs execute, significantly improving consistency and simplifying the process of running experiments across different machines or cloud environments.
© 2025 ApX Machine Learning