All Courses

Hands-on practical: Mounting Datasets and Saving Models

Theory provides the foundation, but practical application solidifies understanding. In this hands-on exercise, we'll apply the concepts of bind mounts and Docker volumes to a common Machine Learning workflow: training a simple model using a dataset from the host machine and saving the resulting model artifact back to the host or a persistent volume.

We'll work through a scenario where you have a dataset locally and a Python script designed to train a model. Our goal is to run this script inside a container, feeding it the data and retrieving the trained model without permanently embedding either within the container image itself.

Prerequisites

Before starting, ensure you have:

Docker Desktop or Docker Engine installed and running.
A simple dataset file. Create a file named data.csv in a local directory (e.g., project/data) with the following content:
```
feature1,feature2,target
1.0,2.0,0
1.5,2.5,0
3.0,4.0,1
3.5,4.5,1
```

A basic Python training script. Create a file named train.py in your project directory (e.g., project/train.py):

import argparse
import pandas as pd
from sklearn.linear_model import LogisticRegression
import joblib
import os

# Set up argument parser
parser = argparse.ArgumentParser(description='Simple scikit-learn model training script.')
parser.add_argument('--data_path', type=str, required=True, help='Path to the input CSV dataset.')
parser.add_argument('--model_dir', type=str, required=True, help='Directory to save the trained model.')

# Parse arguments
args = parser.parse_args()

# Ensure model directory exists
os.makedirs(args.model_dir, exist_ok=True)
model_save_path = os.path.join(args.model_dir, 'model.joblib')

print(f"Loading data from: {args.data_path}")
try:
    # Load data
    df = pd.read_csv(args.data_path)
    X = df[['feature1', 'feature2']]
    y = df['target']

    # Train a simple model
    print("Training model...")
    model = LogisticRegression()
    model.fit(X, y)

    # Save the model
    print(f"Saving model to: {model_save_path}")
    joblib.dump(model, model_save_path)

    print("Training complete and model saved.")

except FileNotFoundError:
    print(f"Error: Data file not found at {args.data_path}")
    exit(1)
except Exception as e:
    print(f"An error occurred: {e}")
    exit(1)

A Dockerfile. Create a file named Dockerfile in your project directory (e.g., project/Dockerfile):

# Use a standard Python base image
FROM python:3.9-slim

# Set the working directory inside the container
WORKDIR /app

# Install necessary Python libraries
RUN pip install --no-cache-dir scikit-learn==1.0.2 pandas==1.3.5 joblib==1.1.0

# Copy the training script into the container
COPY train.py .

# Define the entrypoint for the container
ENTRYPOINT ["python", "train.py"]

Your project directory should look something like this:

project/
├── Dockerfile
├── train.py
└── data/
    └── data.csv

Building the Docker Image

First, navigate to your project directory in your terminal and build the Docker image:

cd /path/to/project
docker build -t ml-data-practice .

This command builds an image tagged ml-data-practice based on your Dockerfile, including Python, required libraries, and the train.py script.

Method 1: Using Bind Mounts

Bind mounts directly map a directory from your host machine into the container. This is often convenient during development as changes on the host are immediately reflected inside the container.

Create an output directory: On your host machine, create a directory where the model will be saved, for example, project/models.
```
mkdir /path/to/project/models
```
Run the container with bind mounts: Execute the training script within the container, mounting the local data directory to /app/data inside the container and the local models directory to /app/output inside the container.
```
docker run --rm \
  -v "$(pwd)/data":/app/data \
  -v "$(pwd)/models":/app/output \
  ml-data-practice \
  --data_path /app/data/data.csv \
  --model_dir /app/output
```
- --rm: Automatically removes the container when it exits.
- -v "$(pwd)/data":/app/data: Mounts the data subdirectory from your current working directory (host) to /app/data (container).
- -v "$(pwd)/models":/app/output: Mounts the models subdirectory from your current working directory (host) to /app/output (container).
- ml-data-practice: The name of the image to use.
- --data_path /app/data/data.csv: Argument passed to train.py, specifying the path inside the container where the data file is mounted.
- --model_dir /app/output: Argument passed to train.py, specifying the path inside the container where the model should be saved.
Verify the output: After the container finishes, check your local project/models directory. You should find the model.joblib file saved there.
```
ls /path/to/project/models
# Output should include: model.joblib
```

Bind mounts provide a direct link between the host and container, making data access straightforward for local development. However, they create a dependency on the host's file structure and can sometimes lead to permission issues depending on the operating system and user configurations.

Method 2: Using Docker Volumes

Docker volumes are managed by Docker itself and are the preferred way to handle persistent data in containers, especially in production or when you want to decouple the data lifecycle from the host machine.

Create Docker volumes: We need one volume for the input data and another for the output model.
```
docker volume create ml-input-data
docker volume create ml-output-models
```
Populate the input volume: Unlike bind mounts, volumes don't automatically see host files. We need to copy our dataset into the ml-input-data volume. A common way is using a temporary helper container:
```
docker run --rm \
  -v ml-input-data:/volume_data \
  -v "$(pwd)/data":/host_data \
  alpine \
  cp /host_data/data.csv /volume_data/
```
- This command runs a lightweight alpine container.
- It mounts our new ml-input-data volume to /volume_data.
- It also bind mounts our local data directory to /host_data.
- The cp command copies the dataset from the host bind mount path to the volume path inside this temporary container. Once the container exits (--rm), the data persists in the ml-input-data volume.
Run the container with volumes: Now, run the training container, mounting the Docker volumes.
```
docker run --rm \
  -v ml-input-data:/app/data \
  -v ml-output-models:/app/output \
  ml-data-practice \
  --data_path /app/data/data.csv \
  --model_dir /app/output
```
- -v ml-input-data:/app/data: Mounts the Docker volume ml-input-data to /app/data inside the container.
- -v ml-output-models:/app/output: Mounts the Docker volume ml-output-models to /app/output inside the container.
Verify the output: The model is now saved inside the ml-output-models volume, not directly on your host filesystem. To verify, you can inspect the volume's contents using another temporary container:
```
docker run --rm \
  -v ml-output-models:/volume_data \
  alpine \
  ls /volume_data
```
- This command mounts the ml-output-models volume to /volume_data in a temporary alpine container and lists its contents. You should see model.joblib.

Volumes provide better isolation and are managed by Docker, making them more portable and less prone to host-specific issues. The initial step of populating the volume adds a bit more complexity compared to bind mounts.

Cleaning Up (Optional)

You can remove the Docker volumes if you no longer need them:

docker volume rm ml-input-data ml-output-models

You can also remove the Docker image:

docker image rm ml-data-practice

This practical exercise demonstrated how to use both bind mounts and Docker volumes to supply input data to a containerized ML script and retrieve the output model artifact. Choosing between them depends on your specific needs: bind mounts offer convenience for development by directly linking to host files, while volumes provide persistence suitable for more structured workflows and deployment scenarios. Understanding how to manage data is key to containerizing ML applications.

Was this section helpful?