To achieve full traceability in your machine learning projects, it's not enough to version data with DVC and track experiments with MLflow separately. You need to connect them. Specifically, you must record which version of your data was used for which experiment run. MLflow provides flexible mechanisms to log this DVC-related information alongside your standard parameters and metrics.
Imagine training two models that produce different results. Was the difference due to a change in hyperparameters, the code, or the underlying data? Without explicitly linking the data version used in each MLflow run, answering this question becomes difficult and relies on manual record-keeping or guesswork. Logging DVC metadata directly into MLflow establishes a clear, automated link, ensuring you can always trace an experiment's results back to the exact data snapshot that produced them. This enhances reproducibility and simplifies debugging and comparison.
Several pieces of DVC metadata are valuable to capture within an MLflow run:
data/processed_features
). This confirms which dataset artifact was intended for use..dvc
file or computes it for directories. This is arguably the most important piece of information for ensuring data reproducibility, as it uniquely identifies the state of the data files.You can integrate this information into your MLflow runs using several methods. Choose the one that best fits your workflow complexity and automation needs.
The most straightforward approach involves adding calls to mlflow.log_param()
or mlflow.set_tag()
within your training script. You explicitly pass the relevant DVC information as parameters or tags. Tags are often more suitable for identifiers like paths or hashes, while parameters are typically reserved for numeric values that might influence the model's behavior (like hyperparameters).
For example, if your script accesses data from a DVC-tracked directory data/prepared
, you could log its path as a tag:
import mlflow
import os
# Assume 'data_path' points to your DVC-tracked data
data_path = "data/prepared"
with mlflow.start_run():
# Log the path as a tag for informational purposes
mlflow.set_tag("data_path", data_path)
# Log other parameters
mlflow.log_param("learning_rate", 0.01)
# ... rest of your training code using data from data_path ...
print(f"MLflow Run ID: {mlflow.active_run().info.run_id}")
print(f"Logged data_path tag: {data_path}")
While simple, manually logging hashes requires you to know the hash beforehand or extract it manually by inspecting the .dvc
file, which isn't ideal for automated workflows where data versions might change frequently.
A more automated and robust method involves using the dvc.api
module. This allows your Python script to programmatically query DVC for information about tracked files or directories without needing to shell out to the command line.
First, ensure you have the necessary DVC components installed. The base dvc
package might suffice, or you might need extras depending on how you interact with remotes or need specific API features:
# Install base DVC if not already present
pip install dvc
# Optional: Install API extras or remote-specific ones if needed
# pip install dvc[api]
# pip install dvc[s3] # Example for S3 remote interaction
While the dvc.api
provides functions like get_url
(to get the cache path or remote URL) or read
(to read file content), getting the specific version hash as recorded in the .dvc
file programmatically requires a bit more work. A common practical approach is to read and parse the .dvc
file directly, as these are typically small text files (often YAML or JSON).
Let's illustrate by reading the hash from data/features.csv.dvc
, assuming it's a single-output file tracked by DVC:
import mlflow
import os
import yaml # To parse the .dvc file (assuming YAML format)
import json # Or use JSON if your .dvc files are in that format
# Path to the .dvc file representing your dataset version
dvc_file_path = "data/features.csv.dvc"
# Actual data path used by the script
data_path = "data/features.csv"
data_version_hash = None
# Check if the .dvc file exists
if os.path.exists(dvc_file_path):
try:
with open(dvc_file_path, 'r') as f:
# Attempt to load as YAML, fallback to JSON or handle plain text if needed
try:
dvc_content = yaml.safe_load(f)
except yaml.YAMLError:
f.seek(0) # Rewind file pointer
try:
dvc_content = json.load(f)
except json.JSONDecodeError:
print(f"Warning: Could not parse {dvc_file_path} as YAML or JSON.")
dvc_content = {} # Assign empty dict to avoid error below
# Extract the hash: Structure depends on DVC version and config
# Common location: 'outs' list -> first item -> 'hash' (DVC >= 3.0) or 'md5'
if 'outs' in dvc_content and isinstance(dvc_content['outs'], list) and len(dvc_content['outs']) > 0:
output_info = dvc_content['outs'][0]
if isinstance(output_info, dict):
if 'hash' in output_info: # DVC >= 3.0 uses 'hash'
data_version_hash = output_info.get('hash')
elif 'md5' in output_info: # Older DVC versions used 'md5'
data_version_hash = output_info.get('md5')
if not data_version_hash:
print(f"Warning: Could not find hash ('hash' or 'md5') in {dvc_file_path}")
except Exception as e:
print(f"Warning: Error reading DVC file {dvc_file_path}: {e}")
# Start MLflow run and log the information
with mlflow.start_run():
mlflow.set_tag("data_path", data_path) # Log the data path used
if data_version_hash:
# Log the hash as a tag - it's an identifier
mlflow.set_tag("dvc_data_version_hash", data_version_hash)
else:
mlflow.set_tag("dvc_data_version_status", "Hash unavailable or not found")
# Log other parameters as usual
mlflow.log_param("learning_rate", 0.01)
# ... rest of your training code using data_path ...
run_id = mlflow.active_run().info.run_id
print(f"MLflow Run ID: {run_id}")
if data_version_hash:
print(f"Logged DVC data version hash: {data_version_hash}")
else:
print("DVC data version hash was not logged.")
Note: The structure of
.dvc
files can evolve. The parsing logic above covers common YAML/JSON formats but might need adjustments for different DVC versions or configurations (e.g., multi-output files, different hash types likeetag
). Always inspect your.dvc
files to confirm the structure. Using tags (mlflow.set_tag
) is generally preferred for identifiers like hashes and paths.
Another strategy involves passing the data path or even a descriptive version identifier (like a Git tag associated with the desired data version) to your script via configuration files (e.g., config.yaml
) or environment variables. Your script then simply reads this configuration value and logs it using mlflow.log_param
or mlflow.set_tag
. This approach decouples the specific version information from the core training logic, making the script more reusable.
# Example: Read data path and a version tag from environment variables
import mlflow
import os
# Read configuration from environment variables, providing defaults
data_path = os.getenv("INPUT_DATA_PATH", "data/features.csv")
data_version_tag = os.getenv("DATA_VERSION_TAG", "unknown") # e.g., "v1.2-processed"
with mlflow.start_run():
# Log the configuration used for this run
mlflow.set_tag("configured_data_path", data_path)
mlflow.set_tag("configured_data_version_tag", data_version_tag)
# Log other parameters
mlflow.log_param("batch_size", 64)
# ... training code loads data from data_path ...
print(f"Using data from: {data_path}")
print(f"Assumed data version tag: {data_version_tag}")
# ... rest of the training ...
You would then set these environment variables before executing the script:
# Set environment variables for the run
export INPUT_DATA_PATH="data/processed_features_v2"
export DATA_VERSION_TAG="release-2024-q1"
# Run the training script
python train.py
This method relies on the process running the script (e.g., a CI/CD pipeline, a DVC stage, or a manual execution) to provide the correct environment variables corresponding to the checked-out data version.
The ideal place to add this DVC metadata logging is early in your training or processing script, typically immediately after initializing the MLflow run (mlflow.start_run()
) and often as part of your data loading or parameter setup phase. This ensures the crucial link between the experiment run and the data version is captured before the main computational work begins.
By consistently logging the path and, more importantly, the version hash or a meaningful tag of your DVC-tracked data within your MLflow runs, you create an explicit and verifiable link between your experiments and the exact data artifacts used. This significantly strengthens reproducibility, allowing you and your team to confidently revisit past experiments, understand all inputs (code, parameters, and data), and reliably reproduce results or debug discrepancies. This practice transforms your MLflow experiment tracking from solely focusing on model performance metrics to providing a comprehensive provenance record for your entire ML workflow.
© 2025 ApX Machine Learning