docker run
docker-compose.yml
Theory provides the foundation, but practical application solidifies understanding. In this hands-on exercise, we'll apply the concepts of bind mounts and Docker volumes to a common Machine Learning workflow: training a simple model using a dataset from the host machine and saving the resulting model artifact back to the host or a persistent volume.
We'll work through a scenario where you have a dataset locally and a Python script designed to train a model. Our goal is to run this script inside a container, feeding it the data and retrieving the trained model without permanently embedding either within the container image itself.
Before starting, ensure you have:
Docker Desktop or Docker Engine installed and running.
A simple dataset file. Create a file named data.csv
in a local directory (e.g., project/data
) with the following content:
feature1,feature2,target
1.0,2.0,0
1.5,2.5,0
3.0,4.0,1
3.5,4.5,1
A basic Python training script. Create a file named train.py
in your project directory (e.g., project/train.py
):
import argparse
import pandas as pd
from sklearn.linear_model import LogisticRegression
import joblib
import os
# Set up argument parser
parser = argparse.ArgumentParser(description='Simple scikit-learn model training script.')
parser.add_argument('--data_path', type=str, required=True, help='Path to the input CSV dataset.')
parser.add_argument('--model_dir', type=str, required=True, help='Directory to save the trained model.')
# Parse arguments
args = parser.parse_args()
# Ensure model directory exists
os.makedirs(args.model_dir, exist_ok=True)
model_save_path = os.path.join(args.model_dir, 'model.joblib')
print(f"Loading data from: {args.data_path}")
try:
# Load data
df = pd.read_csv(args.data_path)
X = df[['feature1', 'feature2']]
y = df['target']
# Train a simple model
print("Training model...")
model = LogisticRegression()
model.fit(X, y)
# Save the model
print(f"Saving model to: {model_save_path}")
joblib.dump(model, model_save_path)
print("Training complete and model saved.")
except FileNotFoundError:
print(f"Error: Data file not found at {args.data_path}")
exit(1)
except Exception as e:
print(f"An error occurred: {e}")
exit(1)
A Dockerfile. Create a file named Dockerfile
in your project directory (e.g., project/Dockerfile
):
# Use a standard Python base image
FROM python:3.9-slim
# Set the working directory inside the container
WORKDIR /app
# Install necessary Python libraries
RUN pip install --no-cache-dir scikit-learn==1.0.2 pandas==1.3.5 joblib==1.1.0
# Copy the training script into the container
COPY train.py .
# Define the entrypoint for the container
ENTRYPOINT ["python", "train.py"]
Your project directory should look something like this:
project/
├── Dockerfile
├── train.py
└── data/
└── data.csv
First, navigate to your project directory in your terminal and build the Docker image:
cd /path/to/project
docker build -t ml-data-practice .
This command builds an image tagged ml-data-practice
based on your Dockerfile
, including Python, required libraries, and the train.py
script.
Bind mounts directly map a directory from your host machine into the container. This is often convenient during development as changes on the host are immediately reflected inside the container.
Create an output directory: On your host machine, create a directory where the model will be saved, for example, project/models
.
mkdir /path/to/project/models
Run the container with bind mounts: Execute the training script within the container, mounting the local data
directory to /app/data
inside the container and the local models
directory to /app/output
inside the container.
docker run --rm \
-v "$(pwd)/data":/app/data \
-v "$(pwd)/models":/app/output \
ml-data-practice \
--data_path /app/data/data.csv \
--model_dir /app/output
--rm
: Automatically removes the container when it exits.-v "$(pwd)/data":/app/data
: Mounts the data
subdirectory from your current working directory (host) to /app/data
(container).-v "$(pwd)/models":/app/output
: Mounts the models
subdirectory from your current working directory (host) to /app/output
(container).ml-data-practice
: The name of the image to use.--data_path /app/data/data.csv
: Argument passed to train.py
, specifying the path inside the container where the data file is mounted.--model_dir /app/output
: Argument passed to train.py
, specifying the path inside the container where the model should be saved.Verify the output: After the container finishes, check your local project/models
directory. You should find the model.joblib
file saved there.
ls /path/to/project/models
# Output should include: model.joblib
Bind mounts provide a direct link between the host and container, making data access straightforward for local development. However, they create a dependency on the host's file structure and can sometimes lead to permission issues depending on the operating system and user configurations.
Docker volumes are managed by Docker itself and are the preferred way to handle persistent data in containers, especially in production or when you want to decouple the data lifecycle from the host machine.
Create Docker volumes: We need one volume for the input data and another for the output model.
docker volume create ml-input-data
docker volume create ml-output-models
Populate the input volume: Unlike bind mounts, volumes don't automatically see host files. We need to copy our dataset into the ml-input-data
volume. A common way is using a temporary helper container:
docker run --rm \
-v ml-input-data:/volume_data \
-v "$(pwd)/data":/host_data \
alpine \
cp /host_data/data.csv /volume_data/
alpine
container.ml-input-data
volume to /volume_data
.data
directory to /host_data
.cp
command copies the dataset from the host bind mount path to the volume path inside this temporary container. Once the container exits (--rm
), the data persists in the ml-input-data
volume.Run the container with volumes: Now, run the training container, mounting the Docker volumes.
docker run --rm \
-v ml-input-data:/app/data \
-v ml-output-models:/app/output \
ml-data-practice \
--data_path /app/data/data.csv \
--model_dir /app/output
-v ml-input-data:/app/data
: Mounts the Docker volume ml-input-data
to /app/data
inside the container.-v ml-output-models:/app/output
: Mounts the Docker volume ml-output-models
to /app/output
inside the container.Verify the output: The model is now saved inside the ml-output-models
volume, not directly on your host filesystem. To verify, you can inspect the volume's contents using another temporary container:
docker run --rm \
-v ml-output-models:/volume_data \
alpine \
ls /volume_data
ml-output-models
volume to /volume_data
in a temporary alpine
container and lists its contents. You should see model.joblib
.Volumes provide better isolation and are managed by Docker, making them more portable and less prone to host-specific issues. The initial step of populating the volume adds a bit more complexity compared to bind mounts.
You can remove the Docker volumes if you no longer need them:
docker volume rm ml-input-data ml-output-models
You can also remove the Docker image:
docker image rm ml-data-practice
This practical exercise demonstrated how to use both bind mounts and Docker volumes to supply input data to a containerized ML script and retrieve the output model artifact. Choosing between them depends on your specific needs: bind mounts offer convenience for development by directly linking to host files, while volumes provide robust, Docker-managed persistence suitable for more structured workflows and deployment scenarios. Understanding how to effectively manage data is fundamental to containerizing ML applications.
© 2025 ApX Machine Learning