docker run
docker-compose.yml
While containerizing your application code provides consistency, the ephemeral nature of containers presents a challenge for Machine Learning tasks: where do the datasets come from, and where do the trained models go? Containers, by default, lose any data written inside them when they are removed. For ML, where datasets can be large and trained models represent significant computational investment, this won't work. We need a way to persist data independently of the container's lifecycle.
Docker Volumes are the preferred mechanism for persisting data generated by and used by Docker containers. Think of a volume as a dedicated, Docker-managed directory on the host machine's filesystem. The significant advantage is that Docker manages the volume's storage area, lifecycle, and permissions, decoupling the data entirely from any specific container.
Unlike bind mounts (which we'll discuss next), where you map a specific directory from your host machine into the container, volumes are created and managed by Docker itself. You interact with them by name. This abstraction makes them easier to manage across different environments and platforms.
You can explicitly create a volume using the Docker command line:
docker volume create my-ml-data
This command creates a new volume named my-ml-data
. Docker handles where on the host filesystem this volume physically resides. You typically don't need to know the exact host path, just the volume's name.
You can list existing volumes:
docker volume ls
Output:
DRIVER VOLUME NAME
local my-ml-data
To get more details about a specific volume, including its mount point on the host (though you rarely interact with this directly):
docker volume inspect my-ml-data
And to remove a volume when you no longer need its data (use with caution!):
docker volume rm my-ml-data
Importantly, you don't always need to create volumes explicitly beforehand. If you specify a named volume when running a container and the volume doesn't exist, Docker will create it for you automatically.
To make the data within a volume accessible to a container, you mount it using the -v
or --mount
flag with the docker run
command. The most common syntax using -v
specifies the volume name and the path inside the container where it should be mounted:
# Syntax: -v <volume-name>:<path-in-container>
docker run -d --name my-training-app -v my-ml-data:/app/data my-image
In this example:
my-ml-data
is the name of the Docker volume (Docker will create it if it doesn't exist)./app/data
is the absolute path inside the container where the volume's contents will appear. Any data the application writes to /app/data
will be stored in the my-ml-data
volume on the host.If you stop and remove the my-training-app
container:
docker stop my-training-app
docker rm my-training-app
The my-ml-data
volume and all the data stored within it remain untouched. You can then start a new container and mount the same volume to access the persisted data.
# Start a new container, maybe for inference, mounting the same volume
docker run --name my-inference-app -v my-ml-data:/app/input-models my-other-image
Now, the second container can read the models or data saved by the first container from its /app/input-models
directory.
Using volumes offers several advantages, particularly for ML tasks:
.h5
, .pkl
, SavedModel formats), checkpoints, logs, and evaluation metrics stored in a volume survive container restarts, removals, and updates.This diagram shows how Docker Volumes exist within a Docker-managed area on the host filesystem. Containers, like the 'Training Container' and 'Inference Container', mount these volumes to specific paths inside them (e.g.,
/data_in_cont
,/model_in_cont
) allowing persistent storage and data sharing independent of the container lifecycles.
Volumes are well-suited for managing various types of persistent data in ML projects:
Let's illustrate with a simple example. Imagine a Python script (save_model.py
) inside a container that simulates saving a model artifact.
save_model.py
:
import os
import time
import datetime
# Define the output directory INSIDE the container
output_dir = "/app/output"
# Create a unique filename based on time
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
model_filename = os.path.join(output_dir, f"model_{timestamp}.txt")
# Ensure output directory exists (best practice inside container)
os.makedirs(output_dir, exist_ok=True)
print(f"Simulating model training...")
time.sleep(1) # Simulate work
print(f"Saving simulated model artifact to: {model_filename}")
# Write some content to the file
with open(model_filename, "w") as f:
f.write("This is a simulated model artifact.\n")
f.write(f"Saved at: {datetime.datetime.now()}\n")
print("Model artifact saved.")
Dockerfile
:
FROM python:3.9-slim
WORKDIR /app
COPY save_model.py .
# Default command to run the script
CMD ["python", "save_model.py"]
Now, let's build the image and run it, using a volume named model-storage
to persist the output.
# 1. Build the Docker image
docker build -t simple-saver .
# 2. Create a named volume (optional, docker run can create it too)
# docker volume create model-storage
# 3. Run the container, mounting the volume to /app/output
# The --rm flag automatically removes the container when it exits.
echo "Running container to save model..."
docker run --rm \
-v model-storage:/app/output \
--name saver-instance \
simple-saver
# The container runs, prints messages, saves the file to /app/output, then exits.
# The container is removed, but the data persists in the 'model-storage' volume.
# 4. Verify the data exists by inspecting the volume from another container
echo "Checking volume contents..."
docker run --rm \
-v model-storage:/data \
python:3.9-slim ls -l /data
You should see output similar to this (the exact filename will differ):
Running container to save model...
Simulating model training...
Saving simulated model artifact to: /app/output/model_20231027_103045.txt
Model artifact saved.
Checking volume contents...
total 4
-rw-r--r-- 1 root root 81 Oct 27 10:30 model_20231027_103045.txt
This demonstrates that even though the saver-instance
container was removed, the model file it created inside /app/output
was persisted in the model-storage
volume and could be accessed later (in this case, by a temporary python:3.9-slim
container mounting the same volume).
In summary, Docker volumes provide a robust and manageable way to handle persistent data requirements in containerized ML applications. They allow you to separate your data concerns from your application code, ensuring that datasets, models, and logs endure beyond the lifespan of individual containers, which is essential for effective ML development and deployment workflows.
© 2025 ApX Machine Learning