Integrating Data Version Control (DVC) and MLflow Tracking creates a powerful combination for managing machine learning projects. While previous sections demonstrated the mechanics of integration, adopting consistent practices ensures your workflows remain understandable, maintainable, and truly reproducible over time. Here, we consolidate recommendations for effectively using DVC and MLflow together.
A well-defined project structure is fundamental for clarity and automation. While flexibility exists, adopting a standard layout helps team members navigate the project and simplifies scripting for integration.
Consider a structure like this:
project-root/
├── .dvc/ # DVC internal files
├── .git/ # Git internal files
├── data/
│ ├── raw/ # Raw, immutable data (potentially DVC-tracked)
│ ├── processed/ # Processed data (output of DVC stages)
│ └── features/ # Feature data (output of DVC stages)
├── models/ # Trained models (potentially DVC-tracked)
├── notebooks/ # Exploratory analysis notebooks
├── src/ # Source code for data processing, training, etc.
│ ├── process_data.py
│ └── train_model.py
├── tests/ # Unit and integration tests
├── .dvcignore # Specify files/directories DVC should ignore
├── .gitignore # Specify files/directories Git should ignore
├── dvc.yaml # DVC pipeline definitions
├── mlflow_utils.py # Helper functions for MLflow logging
├── params.yaml # Project parameters (hyperparameters, paths)
└── requirements.txt # Python package dependencies
Key points:
dvc.yaml
to manage the transition.src/
) separately from notebooks (notebooks/
). Code in src/
is typically executed by DVC pipelines or scripts.params.yaml
. This file is tracked by Git, and its values can be used by both DVC stages and MLflow logging..gitignore
for files Git should ignore (e.g., virtual environments, large data files not tracked by DVC) and .dvcignore
for files DVC should ignore (e.g., temporary files within tracked directories).Reproducibility hinges on knowing the exact state of your code, data, configuration, and environment for any given experiment run. Structure your commits to capture these elements together.
A typical workflow:
src/train_model.py
) or parameters (params.yaml
).dvc repro <stage_name>
to update derived data/features. This updates dvc.lock
.git add src/train_model.py params.yaml dvc.lock data/.dvc
(or relevant .dvc
files/directories).git commit -m "feat: Tune hyperparameters for model X"
git push
dvc push
Each Git commit should represent a coherent change, linking the code version, the data versions (via .dvc
files and dvc.lock
), and the configuration (params.yaml
) used. MLflow automatically logs the Git commit hash associated with a run, creating a link back to this snapshot.
While MLflow logs the Git commit, which indirectly points to DVC-tracked data via .dvc
files, explicitly logging data version information in MLflow adds clarity.
Strategies include:
import dvc.api
import mlflow
# Get hash of the processed data directory
data_version = dvc.api.read('data/processed/features.csv.dvc') # Reads the .dvc file content
# Or get the hash of the directory itself if using newer DVC versions
# data_status = dvc.api.status(['data/processed'], show_checksum=True)
# data_hash = data_status['data/processed'].get('checksum')
with mlflow.start_run():
# Log the DVC hash (example, needs parsing from dvc.api.read output)
# Assuming 'md5' is extracted from the read() output
# mlflow.log_param("dvc_data_md5", data_version['outs'][0]['md5'])
mlflow.log_param("dvc_input_data", "data/processed/features.csv") # Log path
# Log parameters from params.yaml
# ...
# Train model
# ...
# Log metrics
# ...
Note: Parsing the specific hash from dvc.api.read
or using alternative methods might be needed depending on your DVC version and setup.dvc.lock
as an Artifact: The dvc.lock
file contains precise hashes for all pipeline outputs. Logging this file as an MLflow artifact provides a complete snapshot of data dependencies for pipeline-driven experiments.
mlflow.log_artifact("dvc.lock")
mlflow.set_tag()
.Define your workflow steps (data processing, feature engineering, training) as stages in dvc.yaml
rather than relying solely on individual scripts or notebooks run manually.
Benefits:
dvc repro
automatically re-runs only the necessary stages when dependencies (code, data, parameters) change.dvc.yaml
and dvc.lock
explicitly define the workflow and its results.Integrate MLflow logging directly into the commands executed by DVC stages:
# dvc.yaml
stages:
process_data:
cmd: python src/process_data.py --config=params.yaml
deps:
- data/raw/input.csv
- src/process_data.py
params:
- processing.param1
outs:
- data/processed/features.csv
train:
cmd: python src/train_model.py --config=params.yaml
deps:
- data/processed/features.csv
- src/train_model.py
params:
- training.learning_rate
- training.epochs
metrics:
- metrics.json: # DVC can track metrics from files
cache: false
plots:
- plots/confusion_matrix.png: # DVC can track plots
cache: false
x: actual
y: predicted
outs:
- models/model.pkl # Track the final model with DVC
Your src/train_model.py
script would contain the necessary mlflow.start_run()
, mlflow.log_param()
, mlflow.log_metric()
, and mlflow.log_artifact()
calls. It should also write metrics to metrics.json
and plots to plots/
for DVC tracking if desired.
Use a configuration file like params.yaml
(tracked by Git) to store hyperparameters, file paths, and other settings.
params.yaml
within your dvc.yaml
stages using the params
section.params.yaml
within your Python scripts (src/*.py
) to access parameters during execution.params.yaml
to MLflow using mlflow.log_params()
.This keeps configuration consistent across DVC pipeline definitions, script execution, and MLflow tracking.
Both DVC and MLflow can store artifacts, but they serve slightly different purposes:
Recommendations:
dvc.yaml
).Reproducibility extends to the software environment.
requirements.txt
(for pip) or an environment file (for Conda). Track this file with Git.requirements.txt
file itself as an MLflow artifact for each run.Dockerfile
tracked by Git.By adhering to these practices, you create a robust system where DVC manages your data lifecycle and pipeline execution, while MLflow provides detailed records of your experiments. This integrated approach significantly enhances the reproducibility, collaboration, and maintainability of your machine learning projects.
© 2025 ApX Machine Learning