A well-organized project structure is foundational for successfully integrating data versioning with DVC and experiment tracking with MLflow. While there's no single mandatory layout, adopting a conventional structure significantly improves clarity, maintainability, and collaboration, especially as projects grow in complexity. This structure helps separate concerns like data, source code, configuration, and outputs, making it easier for both humans and automation tools to navigate.
Let's explore a common and effective way to structure your machine learning projects when using DVC and MLflow together.
A typical project layout often includes the following components:
data/
: This directory houses your datasets. It might contain subdirectories like data/raw
for original, untouched data and data/processed
for cleaned or transformed data ready for modeling. DVC is primarily used to track the contents of this directory, often by creating .dvc
files that act as pointers to the actual data stored elsewhere (like cloud storage). The large data files themselves are typically added to .gitignore
.src/
: Contains your source code, usually Python (.py
) files. This includes scripts or modules for data loading, preprocessing, feature engineering, model training, evaluation, and potentially utility functions. This code is versioned using Git.models/
: A potential location for saving trained model artifacts (e.g., serialized model files like .pkl
or .h5
). These models can be tracked using DVC if they are large, or logged directly as artifacts using MLflow during experiment runs. The choice depends on your workflow needs.notebooks/
: Often used for exploratory data analysis (EDA), initial experimentation, and visualization using Jupyter notebooks (.ipynb
). While useful for exploration, it's generally recommended to refactor reusable or production code from notebooks into Python scripts within the src/
directory for better testing, modularity, and use in automated pipelines.dvc.yaml
: This file is central to defining DVC pipelines. It outlines the stages of your workflow (e.g., processing data, training a model), their dependencies (input data, code), commands to execute, and outputs (processed data, models, metrics). We'll explore this file in detail in subsequent sections.params.yaml
: A common practice is to store project parameters, especially hyperparameters for model training or configuration settings for data processing, in a dedicated YAML file. This makes parameters explicit and easy to track. Both DVC pipelines and MLflow can read from this file, ensuring consistency. Changes to params.yaml
are tracked by Git.requirements.txt
or environment.yml
: Standard files for defining Python package dependencies. Specifying dependencies is essential for ensuring that the project environment can be recreated accurately.mlruns/
: The default local directory where MLflow stores experiment tracking data (parameters, metrics, artifacts) if you haven't configured a remote tracking server. This directory should almost always be included in .gitignore
..dvc/
: Contains DVC's internal files, including configuration, cache directory structure, etc. This directory is managed by DVC and should also be in .gitignore
..gitignore
: A critical file for any project using Git. It tells Git which files or directories to ignore. When using DVC and MLflow, it's essential to ignore large data files tracked by DVC, the DVC cache, MLflow's local tracking directory (mlruns/
), virtual environment directories, and other generated files not meant for Git versioning.Here’s a visual representation of a typical project layout:
A common project layout separating data, source code, models, notebooks, configuration (
params.yaml
), DVC pipeline definitions (dvc.yaml
), and ignored directories (.dvc/
,mlruns/
).
This structure facilitates the integrated workflow:
data/raw
. DVC tracks changes (e.g., dvc add data/raw/your_data.csv
). Processed data might land in data/processed
, also tracked by DVC. The actual large files are ignored by Git, but the small .dvc
pointer files are committed.src/
perform tasks like processing, training, and evaluation. Changes to these scripts are tracked by Git.params.yaml
are read by scripts in src/
. Changes to parameters are tracked by Git.src/train.py
runs, it uses MLflow to log parameters (perhaps read from params.yaml
), metrics, and artifacts (like models, which might be saved to models/
or logged directly). MLflow records this information, linking it to the specific Git commit of the code used.dvc.yaml
defines stages that execute scripts from src/
, consume data specified by .dvc
files (or params.yaml
), and produce outputs (like processed data or models tracked by DVC, or metrics files)..gitignore
is Essential: Properly configuring .gitignore
is fundamental to prevent committing large data files, temporary MLflow logs, or DVC cache contents into your Git repository. It ensures Git tracks only the code, configuration, and DVC metadata files.Here's a sample snippet for your .gitignore
in such a project:
# DVC specific
.dvc/cache
.dvc/tmp
.dvc/lock
/data/raw/ # Ignore raw data if managed by DVC
/data/processed/ # Ignore processed data if managed by DVC
/models/ # Ignore models if managed by DVC or logged via MLflow
# MLflow specific
mlruns/
# Python specific
__pycache__/
*.pyc
*.pyo
*.pyd
.env
venv/
env/
*.egg-info/
dist/
build/
# Notebook specific
.ipynb_checkpoints
# OS specific
.DS_Store
Thumbs.db
Example
.gitignore
rules tailored for a project using DVC and MLflow. Adjust paths like/data/raw/
based on which specific files or directories you track withdvc add
.
Adopting a structure like this from the beginning of your project provides a solid framework for integrating DVC and MLflow, leading to more organized, reproducible, and maintainable machine learning systems. While this template serves as a strong starting point, feel free to adapt it based on the specific requirements and scale of your project.
© 2025 ApX Machine Learning