Containerize a simple machine learning training script, build a Docker image for it, and execute the training process within a container. Data inputs and model outputs are managed using bind mounts. This approach solidifies the process of creating reproducible training environments.PrerequisitesDocker installed and running on your system.Python 3 installed locally (for preparing the example).A text editor or IDE.The Training Script (Example: train.py)First, let's create a basic Python script that trains a Scikit-learn Logistic Regression model on the Iris dataset. We'll design it to accept input/output directory paths and a hyperparameter (regularization strength C) via command-line arguments.Save the following code as src/train.py:# src/train.py import argparse import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import joblib import os def train_model(input_dir, output_dir, C): """Loads data, trains a model, and saves it.""" print(f"Loading data from: {input_dir}") # Assuming iris.csv is in the input directory data_path = os.path.join(input_dir, 'iris.csv') try: iris_df = pd.read_csv(data_path) except FileNotFoundError: print(f"Error: Could not find {data_path}. Make sure iris.csv is mounted correctly.") return print("Data loaded successfully.") X = iris_df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] y = iris_df['species'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print(f"Training Logistic Regression model with C={C}...") model = LogisticRegression(C=C, max_iter=200, random_state=42) model.fit(X_train, y_train) # Evaluate y_pred = model.predict(X_test) acc = accuracy_score(y_test, y_pred) print(f"Model accuracy on test set: {acc:.4f}") # Save the model os.makedirs(output_dir, exist_ok=True) # Ensure output directory exists model_path = os.path.join(output_dir, 'iris_model.joblib') joblib.dump(model, model_path) print(f"Model saved to: {model_path}") if __name__ == "__main__": parser = argparse.ArgumentParser(description="Train a Logistic Regression model on Iris data.") parser.add_argument('--input-dir', type=str, required=True, help='Directory containing iris.csv') parser.add_argument('--output-dir', type=str, required=True, help='Directory to save the trained model') parser.add_argument('--C', type=float, default=1.0, help='Inverse of regularization strength') args = parser.parse_args() train_model(args.input_dir, args.output_dir, args.C)This script uses argparse to handle command-line arguments, loads data using pandas, trains a model with Scikit-learn, and saves the trained model using joblib. It explicitly expects paths for input data and output models.Project StructureOrganize your project files like this:ml-training-project/ ├── Dockerfile ├── requirements.txt ├── src/ │ └── train.py ├── data/ │ └── iris.csv # You'll need to download/create this └── output/ # This directory will be created or used for model outputYou can easily find the Iris dataset online (e.g., from Kaggle or UCI Machine Learning Repository) or create a sample iris.csv file and place it in the data/ directory. Make sure it has columns like sepal_length, sepal_width, petal_length, petal_width, and species.Dependencies (requirements.txt)Create a requirements.txt file listing the necessary Python libraries:pandas scikit-learn==1.2.2 # Pinning version for reproducibility joblibNote: Pinning versions (like scikit-learn==1.2.2) is a good practice for ensuring reproducibility.Creating the DockerfileNow, create the Dockerfile in the project root directory (ml-training-project/):# Use an official Python runtime as a parent image FROM python:3.9-slim # Set the working directory in the container WORKDIR /app # Copy the requirements file first to leverage Docker cache COPY requirements.txt . # Install any needed packages specified in requirements.txt # Use --no-cache-dir to reduce image size RUN pip install --no-cache-dir -r requirements.txt # Copy the source code into the container COPY src/ ./src/ # Define the entrypoint for the container # This makes the container behave like an executable ENTRYPOINT ["python", "src/train.py"] # Default command (can be overridden at runtime) # Here, we set default help flag if no args provided CMD ["--help"]Let's break down this Dockerfile:FROM python:3.9-slim: Starts with a lightweight Python 3.9 image.WORKDIR /app: Sets the default directory inside the container to /app. Subsequent commands (COPY, RUN, CMD, ENTRYPOINT) will run relative to this directory.COPY requirements.txt .: Copies only the requirements file.RUN pip install ...: Installs dependencies. This layer is often cached if requirements.txt doesn't change, speeding up subsequent builds.COPY src/ ./src/: Copies our training script directory into the image under /app/src/.ENTRYPOINT ["python", "src/train.py"]: Specifies that containers run using this image will execute python src/train.py by default. Any arguments provided to docker run after the image name will be appended to this command.CMD ["--help"]: Provides a default argument to the ENTRYPOINT. If you run the container without arguments, it will execute python src/train.py --help.Building the Docker ImageNavigate to your project root directory (ml-training-project/) in your terminal and run the build command:docker build -t ml-training-app:latest .-t ml-training-app:latest: Tags the image with the name ml-training-app and the tag latest..: Specifies that the build context (where Docker looks for Dockerfile and files to copy) is the current directory.Docker will execute the steps in your Dockerfile, downloading the base image, installing dependencies, and copying your code.Running the Containerized TrainingNow, let's run the training script inside a container. We need to:Mount the local data directory into the container's /app/data path so the script can read iris.csv.Mount the local output directory into the container's /app/output path so the script can save the iris_model.joblib file back to our host machine.Pass the required command-line arguments (--input-dir, --output-dir, and optionally --C) to the container, which will be appended to the ENTRYPOINT.Execute the following docker run command from your project root directory:docker run --rm \ -v "$(pwd)/data:/app/data" \ -v "$(pwd)/output:/app/output" \ ml-training-app:latest \ --input-dir /app/data \ --output-dir /app/output \ --C 0.5Let's examine this command:docker run: The command to create and start a new container.--rm: Automatically removes the container when it exits. This is useful for one-off tasks like training.-v "$(pwd)/data:/app/data": Mounts the host's current directory ($(pwd)) plus /data to the /app/data directory inside the container. Use ${PWD} on Windows PowerShell or replace $(pwd) with the full path if needed. This makes iris.csv available to the script.-v "$(pwd)/output:/app/output": Mounts the host's output directory (Docker will create it if it doesn't exist) to /app/output inside the container. This is where the script will save the model.ml-training-app:latest: The image to use for the container.--input-dir /app/data --output-dir /app/output --C 0.5: These are the arguments passed to the ENTRYPOINT (python src/train.py). Notice we use the container paths (/app/data, /app/output) here, not the host paths. We also specify a hyperparameter C=0.5.You should see output similar to this in your terminal:Loading data from: /app/data Data loaded successfully. Training Logistic Regression model with C=0.5... Model accuracy on test set: 1.0000 Model saved to: /app/output/iris_model.joblibVerifying the OutputAfter the container finishes, check your local output directory. You should find the iris_model.joblib file there, successfully saved from within the container.ls output/iris_model.joblibSummaryIn this practical exercise, you successfully:Created a Python training script designed to work within a containerized environment by accepting input/output paths.Wrote a Dockerfile to define the environment, install dependencies, and copy the training code.Built a Docker image containing the training script and its environment.Executed the training process within an isolated container using docker run.Managed data input and model output using bind mounts (-v) to link host directories with container directories.Passed configuration (hyperparameters) to the script via command-line arguments.This process demonstrates the core workflow for containerizing ML training. By packaging the code and dependencies together, you ensure that the training environment is consistent and reproducible, regardless of where the Docker image is run. Using volumes allows interaction with the host filesystem for data and results, bridging the gap between the isolated container and the external environment.