docker run
docker-compose.yml
Let's put the techniques discussed in this chapter into practice. We will containerize a simple machine learning training script, build a Docker image for it, and execute the training process within a container, managing data inputs and model outputs using bind mounts. This exercise solidifies the process of creating reproducible training environments.
First, let's create a basic Python script that trains a Scikit-learn Logistic Regression model on the Iris dataset. We'll design it to accept input/output directory paths and a hyperparameter (regularization strength C
) via command-line arguments.
Save the following code as src/train.py
:
# src/train.py
import argparse
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
import os
def train_model(input_dir, output_dir, C):
"""Loads data, trains a model, and saves it."""
print(f"Loading data from: {input_dir}")
# Assuming iris.csv is in the input directory
data_path = os.path.join(input_dir, 'iris.csv')
try:
iris_df = pd.read_csv(data_path)
except FileNotFoundError:
print(f"Error: Could not find {data_path}. Make sure iris.csv is mounted correctly.")
return
print("Data loaded successfully.")
X = iris_df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = iris_df['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training Logistic Regression model with C={C}...")
model = LogisticRegression(C=C, max_iter=200, random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Model accuracy on test set: {acc:.4f}")
# Save the model
os.makedirs(output_dir, exist_ok=True) # Ensure output directory exists
model_path = os.path.join(output_dir, 'iris_model.joblib')
joblib.dump(model, model_path)
print(f"Model saved to: {model_path}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Train a Logistic Regression model on Iris data.")
parser.add_argument('--input-dir', type=str, required=True, help='Directory containing iris.csv')
parser.add_argument('--output-dir', type=str, required=True, help='Directory to save the trained model')
parser.add_argument('--C', type=float, default=1.0, help='Inverse of regularization strength')
args = parser.parse_args()
train_model(args.input_dir, args.output_dir, args.C)
This script uses argparse
to handle command-line arguments, loads data using pandas, trains a model with Scikit-learn, and saves the trained model using joblib
. It explicitly expects paths for input data and output models.
Organize your project files like this:
ml-training-project/
├── Dockerfile
├── requirements.txt
├── src/
│ └── train.py
├── data/
│ └── iris.csv # You'll need to download/create this
└── output/ # This directory will be created or used for model output
You can easily find the Iris dataset online (e.g., from Kaggle or UCI Machine Learning Repository) or create a sample iris.csv
file and place it in the data/
directory. Make sure it has columns like sepal_length
, sepal_width
, petal_length
, petal_width
, and species
.
requirements.txt
)Create a requirements.txt
file listing the necessary Python libraries:
pandas
scikit-learn==1.2.2 # Pinning version for reproducibility
joblib
Note: Pinning versions (like scikit-learn==1.2.2
) is a good practice for ensuring reproducibility.
Now, create the Dockerfile
in the project root directory (ml-training-project/
):
# Use an official Python runtime as a parent image
FROM python:3.9-slim
# Set the working directory in the container
WORKDIR /app
# Copy the requirements file first to leverage Docker cache
COPY requirements.txt .
# Install any needed packages specified in requirements.txt
# Use --no-cache-dir to reduce image size
RUN pip install --no-cache-dir -r requirements.txt
# Copy the source code into the container
COPY src/ ./src/
# Define the entrypoint for the container
# This makes the container behave like an executable
ENTRYPOINT ["python", "src/train.py"]
# Default command (can be overridden at runtime)
# Here, we set default help flag if no args provided
CMD ["--help"]
Let's break down this Dockerfile:
FROM python:3.9-slim
: Starts with a lightweight Python 3.9 image.WORKDIR /app
: Sets the default directory inside the container to /app
. Subsequent commands (COPY
, RUN
, CMD
, ENTRYPOINT
) will run relative to this directory.COPY requirements.txt .
: Copies only the requirements file.RUN pip install ...
: Installs dependencies. This layer is often cached if requirements.txt
doesn't change, speeding up subsequent builds.COPY src/ ./src/
: Copies our training script directory into the image under /app/src/
.ENTRYPOINT ["python", "src/train.py"]
: Specifies that containers run using this image will execute python src/train.py
by default. Any arguments provided to docker run
after the image name will be appended to this command.CMD ["--help"]
: Provides a default argument to the ENTRYPOINT
. If you run the container without arguments, it will execute python src/train.py --help
.Navigate to your project root directory (ml-training-project/
) in your terminal and run the build command:
docker build -t ml-training-app:latest .
-t ml-training-app:latest
: Tags the image with the name ml-training-app
and the tag latest
..
: Specifies that the build context (where Docker looks for Dockerfile
and files to copy) is the current directory.Docker will execute the steps in your Dockerfile
, downloading the base image, installing dependencies, and copying your code.
Now, let's run the training script inside a container. We need to:
data
directory into the container's /app/data
path so the script can read iris.csv
.output
directory into the container's /app/output
path so the script can save the iris_model.joblib
file back to our host machine.--input-dir
, --output-dir
, and optionally --C
) to the container, which will be appended to the ENTRYPOINT
.Execute the following docker run
command from your project root directory:
docker run --rm \
-v "$(pwd)/data:/app/data" \
-v "$(pwd)/output:/app/output" \
ml-training-app:latest \
--input-dir /app/data \
--output-dir /app/output \
--C 0.5
Let's examine this command:
docker run
: The command to create and start a new container.--rm
: Automatically removes the container when it exits. This is useful for one-off tasks like training.-v "$(pwd)/data:/app/data"
: Mounts the host's current directory ($(pwd)
) plus /data
to the /app/data
directory inside the container. Use ${PWD}
on Windows PowerShell or replace $(pwd)
with the full path if needed. This makes iris.csv
available to the script.-v "$(pwd)/output:/app/output"
: Mounts the host's output
directory (Docker will create it if it doesn't exist) to /app/output
inside the container. This is where the script will save the model.ml-training-app:latest
: The image to use for the container.--input-dir /app/data --output-dir /app/output --C 0.5
: These are the arguments passed to the ENTRYPOINT
(python src/train.py
). Notice we use the container paths (/app/data
, /app/output
) here, not the host paths. We also specify a hyperparameter C=0.5
.You should see output similar to this in your terminal:
Loading data from: /app/data
Data loaded successfully.
Training Logistic Regression model with C=0.5...
Model accuracy on test set: 1.0000
Model saved to: /app/output/iris_model.joblib
After the container finishes, check your local output
directory. You should find the iris_model.joblib
file there, successfully saved from within the container.
ls output/
iris_model.joblib
In this practical exercise, you successfully:
Dockerfile
to define the environment, install dependencies, and copy the training code.docker run
.-v
) to link host directories with container directories.This process demonstrates the core workflow for containerizing ML training. By packaging the code and dependencies together, you ensure that the training environment is consistent and reproducible, regardless of where the Docker image is run. Using volumes allows interaction with the host filesystem for data and results, bridging the gap between the isolated container and the external environment.
© 2025 ApX Machine Learning