All Courses

Structuring Training Scripts for Containers

Transitioning from building Docker images and managing data, we now focus on adapting your machine learning training code to run effectively within those containers. While you could simply copy an existing script into an image, treating the container environment like any other server often leads to brittle setups. Scripts hardcoding file paths or relying on specific user configurations will break when moved into the isolated, standardized environment of a container.

To successfully containerize training, your scripts need to be designed or refactored with the container lifecycle and execution context in mind. This involves making them more flexible, configurable, and aware of how they receive inputs and produce outputs in a containerized setting.

Designing for Container Execution

Think of your containerized training script as a self-contained executable unit. It receives instructions and data, performs its task, and outputs results. To achieve this reliably, consider these principles:

Parameterize Inputs and Outputs: Avoid hardcoding file paths for datasets, model save locations, or configuration files within the script. These paths might differ depending on how volumes or bind mounts are configured when the container is run. Instead, the script should accept these locations as inputs.
Externalize Configuration: Hyperparameters, training settings (like epochs or batch size), and resource flags should not be embedded directly in the code. They should be passed into the script at runtime.
Standardize Logging: Direct log output to standard streams (stdout and stderr). This allows Docker's logging drivers to capture the output easily, making monitoring and debugging straightforward without requiring the script to manage log file locations explicitly.
Assume Statelessness (Generally): Each docker run command typically starts a fresh container instance. Unless you are specifically designing for incremental training using persistent volumes, your script should not rely on state left over from previous runs within the same container image. It should produce the same result given the same inputs and configuration each time it runs.

Implementing Flexible Scripts

Let's look at practical ways to implement these principles, focusing on Python scripts, a common choice for ML training.

Using Command-Line Arguments

The standard way to pass parameters to a script in many environments, including Docker containers, is through command-line arguments. Python's argparse module is excellent for this.

Consider a script that needs input data path, output model path, and learning rate.

# training_script.py
import argparse
import os
import pandas as pd
# Assume scikit-learn or other ML library is installed in the image
# from sklearn.linear_model import LogisticRegression 

def train_model(data_path, model_path, learning_rate):
    print(f"Loading data from: {data_path}")
    # Dummy data loading
    # df = pd.read_csv(data_path) 
    print(f"Training model with learning rate: {learning_rate}")
    # Dummy model training
    # model = LogisticRegression(C=1.0/learning_rate) # Example usage
    # model.fit(df[['feature1']], df['target'])
    print(f"Saving model to: {model_path}")
    # Ensure output directory exists if necessary
    os.makedirs(os.path.dirname(model_path), exist_ok=True)
    # Dummy model saving
    with open(model_path, 'w') as f:
        f.write(f"dummy model trained with lr={learning_rate}")
    print("Training complete.")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Train a simple ML model.")
    
    parser.add_argument('--data-path', 
                        type=str, 
                        required=True, 
                        help='Path to the input training data file (e.g., /data/train.csv)')
                        
    parser.add_argument('--model-path', 
                        type=str, 
                        required=True, 
                        help='Path to save the trained model artifact (e.g., /output/model.pkl)')
                        
    parser.add_argument('--lr', 
                        type=float, 
                        default=0.01, 
                        help='Learning rate for training.')
                        
    args = parser.parse_args()
    
    train_model(args.data_path, args.model_path, args.lr)

When building your Docker image, you would typically set the ENTRYPOINT or CMD to execute this script. For example, in your Dockerfile:

# ... (Base image, dependencies installation) ...

WORKDIR /app
COPY training_script.py .

# Option 1: Using CMD (allows overriding the command easily)
# CMD ["python", "training_script.py"] 
# User would supply arguments like: docker run my-image --data-path /data/input.csv --model-path /output/final_model.pkl --lr 0.005

# Option 2: Using ENTRYPOINT (makes the container behave like the script executable)
ENTRYPOINT ["python", "training_script.py"]
# User supplies arguments directly: docker run my-image --data-path /data/input.csv --model-path /output/final_model.pkl --lr 0.005

By using argparse, the script clearly defines its required inputs and how to provide them. Running the container then involves passing these arguments after the image name, mapping host directories or volumes to the expected container paths (like /data and /output).

Using Environment Variables

Environment variables offer another way to pass configuration, often suitable for settings like API keys, database connection strings, or flags that might be sensitive or differ across deployment environments (dev, staging, prod). Python's os module can access them.

# training_script_env.py
import os
import argparse

def load_config():
    config = {}
    # Get paths from environment variables, fall back to defaults or raise error
    config['data_path'] = os.environ.get('TRAINING_DATA_PATH')
    config['model_path'] = os.environ.get('MODEL_OUTPUT_PATH')
    
    if not config['data_path'] or not config['model_path']:
        raise ValueError("Missing required environment variables: TRAINING_DATA_PATH, MODEL_OUTPUT_PATH")
        
    # Get hyperparameters from environment variables, with defaults
    config['learning_rate'] = float(os.environ.get('LEARNING_RATE', 0.01))
    config['epochs'] = int(os.environ.get('EPOCHS', 10))
    
    return config

def train_model(config):
    print(f"Loading data from: {config['data_path']}")
    print(f"Training model with learning rate: {config['learning_rate']} for {config['epochs']} epochs.")
    print(f"Saving model to: {config['model_path']}")
    # ... (rest of training logic) ...
    print("Training complete.")

if __name__ == "__main__":
    # You might still use argparse for things less likely to change via env vars,
    # or just rely purely on environment variables.
    # parser = argparse.ArgumentParser() 
    # parser.add_argument('--some-other-arg', default='value')
    # args = parser.parse_args()
    
    configuration = load_config()
    train_model(configuration)

You would then run the container passing the environment variables using the -e or --env flag:

docker run \
  -e TRAINING_DATA_PATH=/data/train.csv \
  -e MODEL_OUTPUT_PATH=/output/model.joblib \
  -e LEARNING_RATE=0.005 \
  -e EPOCHS=50 \
  -v /path/on/host/data:/data \
  -v /path/on/host/models:/output \
  my-training-image 
  # Assumes the image's CMD or ENTRYPOINT runs training_script_env.py

Choosing between Arguments and Environment Variables:

Arguments: Often better for things directly related to a specific run, like file paths or primary hyperparameters that frequently change between experiments. They are explicit in the docker run command.
Environment Variables: Good for configuration that might be consistent across several runs or differs between environments (dev/prod), like database credentials, API endpoints, or fixed configuration paths expected by company standards. They can also be loaded from files (--env-file).

Handling File Paths Correctly

Regardless of whether paths are passed via arguments or environment variables, the script needs to use them correctly.

Use the Provided Paths: Always reference data and output locations using the variables populated from arguments or environment variables.
Join Paths Safely: If constructing sub-paths (e.g., for logs or checkpoints within an output directory), use os.path.join() to ensure cross-platform compatibility (though less critical inside a Linux container, it's good practice).
Create Output Directories: The directory specified for saving models or logs might not exist, especially when mapping volumes. Use os.makedirs(path, exist_ok=True) before writing files to ensure the target directory is present.

Logging to Standard Streams

Configure Python's standard logging module (or just use print for simple cases) to output messages. Avoid configuring file handlers within the script unless absolutely necessary.

import logging
import sys

# Configure basic logging to stdout
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    stream=sys.stdout) # Explicitly direct to stdout

# Use the logger
logging.info("Starting training process...")
# ... perform training steps ...
logging.warning("Encountered a minor issue...")
logging.info("Training finished.")

Docker automatically captures stdout and stderr, making these logs accessible via docker logs <container_id>.

By structuring your training scripts according to these principles, you create code that is portable, configurable, and integrates smoothly with Docker's mechanisms for data management and execution. This forms the foundation for building reproducible and scalable ML training workflows using containers, which we will build upon in subsequent sections discussing configuration, execution, and GPU usage.

Was this section helpful?