Integrating Data Version Control (DVC) and MLflow Tracking creates a powerful combination for managing machine learning projects. Adopting consistent practices ensures your workflows remain understandable, maintainable, and truly reproducible over time. Recommendations for effectively using DVC and MLflow together are consolidated.
A well-defined project structure is fundamental for clarity and automation. While flexibility exists, adopting a standard layout helps team members navigate the project and simplifies scripting for integration.
Consider a structure like this:
project-root/
├── .dvc/ # DVC internal files
├── .git/ # Git internal files
├── data/
│ ├── raw/ # Raw, immutable data (potentially DVC-tracked)
│ ├── processed/ # Processed data (output of DVC stages)
│ └── features/ # Feature data (output of DVC stages)
├── models/ # Trained models (potentially DVC-tracked)
├── notebooks/ # Exploratory analysis notebooks
├── src/ # Source code for data processing, training, etc.
│ ├── process_data.py
│ └── train_model.py
├── tests/ # Unit and integration tests
├── .dvcignore # Specify files/directories DVC should ignore
├── .gitignore # Specify files/directories Git should ignore
├── dvc.yaml # DVC pipeline definitions
├── mlflow_utils.py # Helper functions for MLflow logging
├── params.yaml # Project parameters (hyperparameters, paths)
└── requirements.txt # Python package dependencies
Important points:
dvc.yaml to manage the transition.src/) separately from notebooks (notebooks/). Code in src/ is typically executed by DVC pipelines or scripts.params.yaml. This file is tracked by Git, and its values can be used by both DVC stages and MLflow logging..gitignore for files Git should ignore (e.g., virtual environments, large data files not tracked by DVC) and .dvcignore for files DVC should ignore (e.g., temporary files within tracked directories).Reproducibility depends on knowing the exact state of your code, data, configuration, and environment for any given experiment run. Structure your commits to capture these elements together.
A typical workflow:
src/train_model.py) or parameters (params.yaml).dvc repro <stage_name> to update derived data/features. This updates dvc.lock.git add src/train_model.py params.yaml dvc.lock data/.dvc (or relevant .dvc files/directories).git commit -m "feat: Tune hyperparameters for model X"git pushdvc pushEach Git commit should represent a coherent change, linking the code version, the data versions (via .dvc files and dvc.lock), and the configuration (params.yaml) used. MLflow automatically logs the Git commit hash associated with a run, creating a link back to this snapshot.
While MLflow logs the Git commit, which indirectly points to DVC-tracked data via .dvc files, explicitly logging data version information in MLflow adds clarity.
Strategies include:
import dvc.api
import mlflow
# Get hash of the processed data directory
data_version = dvc.api.read('data/processed/features.csv.dvc') # Reads the .dvc file content
# Or get the hash of the directory itself if using newer DVC versions
# data_status = dvc.api.status(['data/processed'], show_checksum=True)
# data_hash = data_status['data/processed'].get('checksum')
with mlflow.start_run():
# Log the DVC hash (example, needs parsing from dvc.api.read output)
# Assuming 'md5' is extracted from the read() output
# mlflow.log_param("dvc_data_md5", data_version['outs'][0]['md5'])
mlflow.log_param("dvc_input_data", "data/processed/features.csv") # Log path
# Log parameters from params.yaml
# ...
# Train model
# ...
# Log metrics
# ...
Note: Parsing the specific hash from dvc.api.read or using alternative methods might be needed depending on your DVC version and setup.dvc.lock as an Artifact: The dvc.lock file contains precise hashes for all pipeline outputs. Logging this file as an MLflow artifact provides a complete snapshot of data dependencies for pipeline-driven experiments.
mlflow.log_artifact("dvc.lock")
mlflow.set_tag().Define your workflow steps (data processing, feature engineering, training) as stages in dvc.yaml rather than relying solely on individual scripts or notebooks run manually.
Benefits:
dvc repro automatically re-runs only the necessary stages when dependencies (code, data, parameters) change.dvc.yaml and dvc.lock explicitly define the workflow and its results.Integrate MLflow logging directly into the commands executed by DVC stages:
# dvc.yaml
stages:
process_data:
cmd: python src/process_data.py --config=params.yaml
deps:
- data/raw/input.csv
- src/process_data.py
params:
- processing.param1
outs:
- data/processed/features.csv
train:
cmd: python src/train_model.py --config=params.yaml
deps:
- data/processed/features.csv
- src/train_model.py
params:
- training.learning_rate
- training.epochs
metrics:
- metrics.json: # DVC can track metrics from files
cache: false
plots:
- plots/confusion_matrix.png: # DVC can track plots
cache: false
x: actual
y: predicted
outs:
- models/model.pkl # Track the final model with DVC
Your src/train_model.py script would contain the necessary mlflow.start_run(), mlflow.log_param(), mlflow.log_metric(), and mlflow.log_artifact() calls. It should also write metrics to metrics.json and plots to plots/ for DVC tracking if desired.
Use a configuration file like params.yaml (tracked by Git) to store hyperparameters, file paths, and other settings.
params.yaml within your dvc.yaml stages using the params section.params.yaml within your Python scripts (src/*.py) to access parameters during execution.params.yaml to MLflow using mlflow.log_params().This keeps configuration consistent across DVC pipeline definitions, script execution, and MLflow tracking.
Both DVC and MLflow can store artifacts, but they serve slightly different purposes:
Recommendations:
dvc.yaml).Reproducibility extends to the software environment.
requirements.txt (for pip) or an environment file (for Conda). Track this file with Git.requirements.txt file itself as an MLflow artifact for each run.Dockerfile tracked by Git.By adhering to these practices, you create a system where DVC manages your data lifecycle and pipeline execution, while MLflow provides detailed records of your experiments. This integrated approach significantly enhances the reproducibility, collaboration, and maintainability of your machine learning projects.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with