docker run
docker-compose.yml
Transitioning from building Docker images and managing data, we now focus on adapting your machine learning training code to run effectively within those containers. While you could simply copy an existing script into an image, treating the container environment like any other server often leads to brittle setups. Scripts hardcoding file paths or relying on specific user configurations will break when moved into the isolated, standardized environment of a container.
To successfully containerize training, your scripts need to be designed or refactored with the container lifecycle and execution context in mind. This involves making them more flexible, configurable, and aware of how they receive inputs and produce outputs in a containerized setting.
Think of your containerized training script as a self-contained executable unit. It receives instructions and data, performs its task, and outputs results. To achieve this reliably, consider these principles:
stdout
and stderr
). This allows Docker's logging drivers to capture the output easily, making monitoring and debugging straightforward without requiring the script to manage log file locations explicitly.docker run
command typically starts a fresh container instance. Unless you are specifically designing for incremental training using persistent volumes, your script should not rely on state left over from previous runs within the same container image. It should produce the same result given the same inputs and configuration each time it runs.Let's look at practical ways to implement these principles, focusing on Python scripts, a common choice for ML training.
The standard way to pass parameters to a script in many environments, including Docker containers, is through command-line arguments. Python's argparse
module is excellent for this.
Consider a script that needs input data path, output model path, and learning rate.
# training_script.py
import argparse
import os
import pandas as pd
# Assume scikit-learn or other ML library is installed in the image
# from sklearn.linear_model import LogisticRegression
def train_model(data_path, model_path, learning_rate):
print(f"Loading data from: {data_path}")
# Dummy data loading
# df = pd.read_csv(data_path)
print(f"Training model with learning rate: {learning_rate}")
# Dummy model training
# model = LogisticRegression(C=1.0/learning_rate) # Example usage
# model.fit(df[['feature1']], df['target'])
print(f"Saving model to: {model_path}")
# Ensure output directory exists if necessary
os.makedirs(os.path.dirname(model_path), exist_ok=True)
# Dummy model saving
with open(model_path, 'w') as f:
f.write(f"dummy model trained with lr={learning_rate}")
print("Training complete.")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Train a simple ML model.")
parser.add_argument('--data-path',
type=str,
required=True,
help='Path to the input training data file (e.g., /data/train.csv)')
parser.add_argument('--model-path',
type=str,
required=True,
help='Path to save the trained model artifact (e.g., /output/model.pkl)')
parser.add_argument('--lr',
type=float,
default=0.01,
help='Learning rate for training.')
args = parser.parse_args()
train_model(args.data_path, args.model_path, args.lr)
When building your Docker image, you would typically set the ENTRYPOINT
or CMD
to execute this script. For example, in your Dockerfile:
# ... (Base image, dependencies installation) ...
WORKDIR /app
COPY training_script.py .
# Option 1: Using CMD (allows overriding the command easily)
# CMD ["python", "training_script.py"]
# User would supply arguments like: docker run my-image --data-path /data/input.csv --model-path /output/final_model.pkl --lr 0.005
# Option 2: Using ENTRYPOINT (makes the container behave like the script executable)
ENTRYPOINT ["python", "training_script.py"]
# User supplies arguments directly: docker run my-image --data-path /data/input.csv --model-path /output/final_model.pkl --lr 0.005
By using argparse
, the script clearly defines its required inputs and how to provide them. Running the container then involves passing these arguments after the image name, mapping host directories or volumes to the expected container paths (like /data
and /output
).
Environment variables offer another way to pass configuration, often suitable for settings like API keys, database connection strings, or flags that might be sensitive or differ across deployment environments (dev, staging, prod). Python's os
module can access them.
# training_script_env.py
import os
import argparse
def load_config():
config = {}
# Get paths from environment variables, fall back to defaults or raise error
config['data_path'] = os.environ.get('TRAINING_DATA_PATH')
config['model_path'] = os.environ.get('MODEL_OUTPUT_PATH')
if not config['data_path'] or not config['model_path']:
raise ValueError("Missing required environment variables: TRAINING_DATA_PATH, MODEL_OUTPUT_PATH")
# Get hyperparameters from environment variables, with defaults
config['learning_rate'] = float(os.environ.get('LEARNING_RATE', 0.01))
config['epochs'] = int(os.environ.get('EPOCHS', 10))
return config
def train_model(config):
print(f"Loading data from: {config['data_path']}")
print(f"Training model with learning rate: {config['learning_rate']} for {config['epochs']} epochs.")
print(f"Saving model to: {config['model_path']}")
# ... (rest of training logic) ...
print("Training complete.")
if __name__ == "__main__":
# You might still use argparse for things less likely to change via env vars,
# or just rely purely on environment variables.
# parser = argparse.ArgumentParser()
# parser.add_argument('--some-other-arg', default='value')
# args = parser.parse_args()
configuration = load_config()
train_model(configuration)
You would then run the container passing the environment variables using the -e
or --env
flag:
docker run \
-e TRAINING_DATA_PATH=/data/train.csv \
-e MODEL_OUTPUT_PATH=/output/model.joblib \
-e LEARNING_RATE=0.005 \
-e EPOCHS=50 \
-v /path/on/host/data:/data \
-v /path/on/host/models:/output \
my-training-image
# Assumes the image's CMD or ENTRYPOINT runs training_script_env.py
Choosing between Arguments and Environment Variables:
docker run
command.--env-file
).Regardless of whether paths are passed via arguments or environment variables, the script needs to use them correctly.
os.path.join()
to ensure cross-platform compatibility (though less critical inside a Linux container, it's good practice).os.makedirs(path, exist_ok=True)
before writing files to ensure the target directory is present.Configure Python's standard logging
module (or just use print
for simple cases) to output messages. Avoid configuring file handlers within the script unless absolutely necessary.
import logging
import sys
# Configure basic logging to stdout
logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
stream=sys.stdout) # Explicitly direct to stdout
# Use the logger
logging.info("Starting training process...")
# ... perform training steps ...
logging.warning("Encountered a minor issue...")
logging.info("Training finished.")
Docker automatically captures stdout
and stderr
, making these logs accessible via docker logs <container_id>
.
By structuring your training scripts according to these principles, you create code that is portable, configurable, and integrates smoothly with Docker's mechanisms for data management and execution. This forms the foundation for building reproducible and scalable ML training workflows using containers, which we will build upon in subsequent sections discussing configuration, execution, and GPU usage.
© 2025 ApX Machine Learning