Now that you understand how to define automated workflows using DVC pipelines (dvc.yaml
or dvc run
), let's explore how to embed MLflow tracking directly within these pipeline stages. This integration provides a powerful way to automatically capture detailed experiment metadata every time your DVC pipeline executes, linking the pipeline's structural reproducibility with MLflow's rich tracking capabilities.
The core idea is straightforward: the scripts or commands executed within your DVC pipeline stages will include standard MLflow API calls for logging. DVC handles the orchestration, ensuring stages run in the correct order with the right dependencies, while the scripts themselves report their parameters, metrics, and artifacts to your configured MLflow tracking server.
Consider a typical machine learning pipeline managed by DVC, perhaps defined in a dvc.yaml
file. You might have stages for data processing, training, and evaluation. Let's focus on the training stage.
Previously, you learned how to define a stage using dvc run
or directly in dvc.yaml
. This stage typically executes a script, like train.py
. To integrate MLflow, you modify this script (train.py
) to include MLflow logging calls.
Here’s a simplified example of what a train.py
script, designed to be run as part of a DVC pipeline, might look like:
# train.py
import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib
import argparse
# Set up argument parsing for parameters coming from dvc.yaml
parser = argparse.ArgumentParser()
parser.add_argument('--n_estimators', type=int, default=100)
parser.add_argument('--max_depth', type=int, default=10)
parser.add_argument('--input_data', type=str, required=True)
parser.add_argument('--output_model', type=str, required=True)
args = parser.parse_args()
# Load data (dependency managed by DVC)
data = pd.read_csv(args.input_data)
X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Start an MLflow run
# MLflow automatically detects Git commit if run within a repo
with mlflow.start_run():
# Log parameters received from DVC stage definition
mlflow.log_param("n_estimators", args.n_estimators)
mlflow.log_param("max_depth", args.max_depth)
# Log information about the input data (tracked by DVC)
mlflow.log_param("input_data_path", args.input_data)
# Train the model
model = RandomForestClassifier(n_estimators=args.n_estimators,
max_depth=args.max_depth,
random_state=42)
model.fit(X_train, y_train)
# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Log metrics
mlflow.log_metric("accuracy", accuracy)
print(f"Accuracy: {accuracy:.4f}")
# Save the model (output managed by DVC)
joblib.dump(model, args.output_model)
print(f"Model saved to {args.output_model}")
# Log the model artifact to MLflow as well
# This provides richer model management via MLflow UI/Registry
mlflow.sklearn.log_model(model, "random-forest-model")
# Log other relevant artifacts if needed
# e.g., feature importance plots, confusion matrix
# mlflow.log_artifact("feature_importance.png")
print("Training script finished.")
Now, let's see how this script is incorporated into a DVC pipeline stage within dvc.yaml
:
# dvc.yaml
stages:
prepare:
# ... data preparation stage definition ...
cmd: python src/prepare.py --input data/raw/data.csv --output data/prepared/features.csv
deps:
- data/raw/data.csv
- src/prepare.py
outs:
- data/prepared/features.csv
train:
# This stage runs our script with MLflow logging
cmd: python src/train.py
--input_data data/prepared/features.csv
--output_model models/rf_model.joblib
--n_estimators 150
--max_depth 15
deps:
- data/prepared/features.csv
- src/train.py
params: # DVC tracks these parameters
- n_estimators
- max_depth
outs: # DVC tracks this output file
- models/rf_model.joblib
metrics: # DVC can also track primary metrics
- metrics.json: # Assuming train.py outputs metrics here too (optional)
cache: false
In this setup:
dvc.yaml
file defines the train
stage.cmd
specifies how to execute the train.py
script, passing parameters like n_estimators
and max_depth
as command-line arguments. These could also be defined in a params.yaml
file and referenced here.deps
lists the dependencies: the prepared data file and the training script itself. If either changes, dvc repro
knows to rerun this stage.params
explicitly tells DVC to track specific parameters (e.g., n_estimators
, max_depth
potentially defined in params.yaml
). Changes in these parameters also trigger a rerun.outs
lists the primary output file (models/rf_model.joblib
) whose hash DVC will track.dvc repro train
(or just dvc repro
), DVC executes the cmd
for the train
stage.train.py
script runs and executes the mlflow.start_run()
, mlflow.log_param()
, mlflow.log_metric()
, and mlflow.sklearn.log_model()
calls.mlruns
or a remote server).dvc.lock
file with the hash of the output models/rf_model.joblib
.This combination gives you a powerful connection:
dvc repro
uses the correct versions of code (src/train.py
) and data (data/prepared/features.csv
) as defined by your Git commit and DVC tracking. It manages the execution flow and caches outputs.dvc repro
, MLflow logs the specific parameters used (even those passed via cmd
), the resulting metrics, and associated artifacts like the trained model.When you examine your experiment history in the MLflow UI, you will see runs corresponding to each execution of the DVC pipeline stage. Because MLflow often automatically logs the Git commit hash associated with a run, you can directly link an MLflow run back to the specific state of your DVC-managed repository (code, dvc.yaml
, dvc.lock
, params.yaml
) that produced it.
dvc repro
, experiment details are logged automatically without manual intervention.params.yaml
), code (src/train.py
), or data dependencies.git commit
, dvc repro
), and the experiment tracking happens as a natural side effect of the pipeline execution.By embedding MLflow calls within the scripts executed by your DVC pipeline stages, you create a robust system where the reproducibility managed by DVC is augmented by the detailed tracking and comparison capabilities of MLflow, leading to more manageable and understandable machine learning projects.
© 2025 ApX Machine Learning