docker run
docker-compose.yml
As we established, Machine Learning workflows heavily rely on data. You need datasets to train your models, you might generate intermediate checkpoints during long training runs, and ultimately, you need to save the trained model artifacts. When working with containers, understanding how data is handled is fundamental, because the default behavior isn't suited for persistence.
Let's look at what happens when a container runs. Docker creates a writable layer on top of the read-only image layers. Think of the image as a blueprint and the container as a running instance built from that blueprint. Any changes the running application makes, such as creating files, modifying existing ones, or downloading data, are written to this specific container's writable layer.
The significant point here is that this writable layer is ephemeral. It is tightly coupled to the life of that single container instance. When the container is stopped and removed (which happens frequently, for example, when updating an application or simply cleaning up resources), this writable layer, along with all the data it contains, is permanently deleted. Imagine downloading a large dataset or completing hours of model training, only to have the results vanish when the container is removed. This clearly won't work for most ML tasks.
We need ways to store data persistently, independent of the container's lifecycle. We need mechanisms that allow:
Docker provides two primary mechanisms to achieve this, which form the core of managing data in containerized ML applications:
The following diagram illustrates the relationship between the container's ephemeral layer and these persistent storage options.
Data within the Container Writable Layer is lost when the container is removed. Bind mounts link directly to the host filesystem, while volumes provide Docker-managed persistent storage.
Understanding this distinction between the container's internal, ephemeral storage and externally mounted persistent storage is the first step towards effectively managing datasets, models, and other artifacts in your containerized ML projects. In the following sections, we will examine bind mounts and volumes in detail, discussing their use cases, advantages, and disadvantages for different ML scenarios.
© 2025 ApX Machine Learning