You've learned how to manage your data versions using DVC alongside your code in Git, and how to track your experiment details using MLflow. Now, let's bridge the gap. For true reproducibility, knowing which version of your data was used for a specific MLflow experiment run is essential. Imagine needing to debug a model that produced unexpected results weeks later, or wanting to precisely replicate the conditions of your best-performing run. Without a clear link between the MLflow run and the DVC-managed data version, this becomes a difficult, if not impossible, task.
The core challenge lies in associating an MLflow run, which exists in the MLflow tracking server or backend files, with a specific state of your data, which is defined by .dvc
files tracked within your Git history. We need a systematic way to record this association when an experiment is executed.
Several approaches exist to establish this connection, ranging from simple manual steps to more automated methods integrated into your training scripts.
The most basic approach is to manually record information about the data version when you log an experiment with MLflow. You could:
.dvc
files are committed. Then, find the current Git commit hash (git rev-parse HEAD
) and log it as a parameter or tag in your MLflow run.git tag v1.0-data
) to mark significant data versions, you can log this tag name to MLflow..dvc
file (the md5
or etag
field) corresponding to your dataset and log it.While simple, manual logging is prone to human error. Forgetting to log the information, logging the wrong hash, or having uncommitted changes in your working directory can easily break the connection and undermine reproducibility. Therefore, automated methods are generally preferred for reliable workflows.
A more robust approach is to automatically capture the Git commit hash of your repository at the time the training script is executed and log it to MLflow. Since your .dvc
files are tracked by Git, the commit hash serves as a pointer to the specific versions of these files used, indirectly linking to the data version.
You can achieve this within your Python training script using libraries like gitpython
or by calling the Git command directly:
import mlflow
import subprocess
import os
# Function to get the current Git commit hash
def get_git_commit_hash():
try:
# Ensure we are in a git repository
if subprocess.call(['git', 'rev-parse', '--is-inside-work-tree'],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL) != 0:
print("Not inside a Git repository. Cannot log commit hash.")
return None
# Check for uncommitted changes
status_output = subprocess.check_output(['git', 'status', '--porcelain']).decode().strip()
if status_output:
print("Warning: Uncommitted changes detected. Logging commit hash of HEAD.")
# Optionally, you could choose to fail or log a specific tag indicating dirty state
commit_hash = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()
return commit_hash
except Exception as e:
print(f"Could not get Git commit hash: {e}")
return None
# Example MLflow Run
with mlflow.start_run() as run:
print(f"MLflow Run ID: {run.info.run_id}")
# Log parameters, metrics etc.
mlflow.log_param("learning_rate", 0.01)
# ... training code ...
mlflow.log_metric("accuracy", 0.95)
# Automatically log the Git commit hash
git_commit = get_git_commit_hash()
if git_commit:
mlflow.set_tag("git_commit", git_commit)
# Or log as parameter: mlflow.log_param("git_commit", git_commit)
print(f"Logged Git commit: {git_commit}")
# Log artifacts like models
# mlflow.sklearn.log_model(...)
In this example, get_git_commit_hash
retrieves the current commit hash. We use mlflow.set_tag
to store it as a tag associated with the run (tags are typically used for metadata, while parameters are often hyperparameters). This automatically links the experiment run to the state of your codebase and your DVC pointers (.dvc
files) at the time the run started. We also added a check for uncommitted changes, as running experiments from a "dirty" Git state can complicate reproducibility.
While the Git commit hash provides an indirect link, you might want to log information more directly related to the data itself. This can be useful if multiple datasets are involved or if you want a more explicit pointer.
.dvc
File Hash: You can extract the data hash directly from the relevant .dvc
file. This hash uniquely identifies the content of the data tracked by that specific file version. You might need to parse the .dvc
file (which is typically YAML) or use DVC commands.import mlflow
import subprocess
import yaml # Requires PyYAML installation: pip install pyyaml
import os
# Function to get the hash from a .dvc file
def get_dvc_file_hash(dvc_file_path):
try:
if not os.path.exists(dvc_file_path):
print(f"DVC file not found: {dvc_file_path}")
return None
with open(dvc_file_path, 'r') as f:
dvc_content = yaml.safe_load(f)
# DVC hash is usually under 'outs' -> first item -> 'md5' or 'hash'
if 'outs' in dvc_content and len(dvc_content['outs']) > 0:
# Check common hash keys ('md5', 'etag', 'hash')
hash_key = next((k for k in ['md5', 'hash', 'etag'] if k in dvc_content['outs'][0]), None)
if hash_key:
return dvc_content['outs'][0][hash_key]
print(f"Could not extract hash from {dvc_file_path}")
return None
except Exception as e:
print(f"Error reading DVC file {dvc_file_path}: {e}")
return None
# --- Inside your MLflow run context ---
with mlflow.start_run() as run:
# ... other logging ...
# Log the hash of the primary dataset's .dvc file
data_dvc_file = "data/processed_data.dvc"
data_hash = get_dvc_file_hash(data_dvc_file)
if data_hash:
mlflow.log_param("data_version_hash", data_hash)
print(f"Logged data hash from {data_dvc_file}: {data_hash}")
# Log Git commit as well for code version
git_commit = get_git_commit_hash() # Assumes function from previous example
if git_commit:
mlflow.set_tag("git_commit", git_commit)
This approach logs the specific content hash of the data output defined in data/processed_data.dvc
. Logging both the Git commit (for code and DVC file versions) and the specific data hash provides complementary information.
The following diagram illustrates how these components connect:
This diagram shows how an MLflow run logs both the Git commit hash (linking to the code and
.dvc
file state) and optionally the specific data hash derived from the.dvc
file, connecting the experiment directly to the versioned data stored via DVC.
By consistently logging the Git commit hash and potentially specific data hashes, you create a traceable link between your experiment results in MLflow and the exact state of your code and data managed by Git and DVC.
To reproduce a specific experiment run:
git_commit
tag and any specific data_version_hash
parameters.git checkout <commit_hash>
to restore the repository state (code and .dvc
files) corresponding to the experiment.dvc pull
to download the data files associated with the .dvc
files present in that commit. If you logged a specific data_version_hash
, you can double-check that the hash in the pulled .dvc
file matches the logged value.Establishing this connection is a fundamental step towards building truly reproducible machine learning workflows. It ensures that you can always trace back your results to the precise code and data that produced them. In the following sections, we will explore how to formalize these steps further using DVC pipelines.
© 2025 ApX Machine Learning