Now, let's apply the concepts we've discussed by building a practical, integrated workflow. This hands-on example demonstrates how DVC and MLflow can work together to manage data, track experiments, and automate pipeline execution for improved reproducibility.
We will create a simple machine learning pipeline consisting of two stages: data preparation and model training. DVC will manage the data artifacts and orchestrate the pipeline, while MLflow will track the parameters, metrics, and model produced during the training stage.
Ensure you have the following installed and configured:
pip install dvc[s3]
or equivalent for your chosen remote storage)pip install mlflow scikit-learn pandas
)git init
, git add
, git commit
).First, let's set up our project directory structure.
Create a new directory for the project and navigate into it:
mkdir integrated-pipeline-example
cd integrated-pipeline-example
Initialize Git and DVC:
git init
dvc init
This creates the .git
and .dvc
directories. Remember to commit the initial DVC configuration files:
git add .dvc .dvcignore
git commit -m "Initialize DVC"
Create the necessary subdirectories:
mkdir data src models data/raw data/processed
Create placeholder files for our code and parameters:
touch src/prepare.py src/train.py params.yaml requirements.txt
Add a simple raw dataset. For this example, create a dummy CSV file data/raw/data.csv
:
feature1,feature2,target
1.0,2.1,0
1.5,2.5,0
1.8,2.9,0
3.2,4.5,1
3.5,5.1,1
4.0,6.0,1
0.8,1.9,0
3.8,5.5,1
Add all project structure files to Git:
git add data/raw/data.csv src/ params.yaml requirements.txt models/ .gitignore
# (Create a .gitignore file if needed, e.g., add __pycache__/, *.pyc, etc.)
git commit -m "Initial project structure and raw data"
This stage will take the raw data, perform a simple preparation step (like splitting), and save the processed data. DVC will manage the processed data files.
Edit src/prepare.py
:
# src/prepare.py
import pandas as pd
from sklearn.model_selection import train_test_split
import os
import yaml
# Ensure processed directory exists
os.makedirs('data/processed', exist_ok=True)
# Load parameters
with open('params.yaml', 'r') as f:
params = yaml.safe_load(f)
split_ratio = params['prepare']['split']
seed = params['base']['seed']
# Load raw data
raw_data_path = 'data/raw/data.csv'
df = pd.read_csv(raw_data_path)
# Split data
train_df, test_df = train_test_split(df, test_size=split_ratio, random_state=seed)
# Save processed data
train_output_path = 'data/processed/train.csv'
test_output_path = 'data/processed/test.csv'
train_df.to_csv(train_output_path, index=False)
test_df.to_csv(test_output_path, index=False)
print(f"Processed data saved:")
print(f"- Train set: {train_output_path}")
print(f"- Test set: {test_output_path}")
Edit params.yaml
: Define parameters used in the preparation and upcoming training stages.
# params.yaml
base:
seed: 42
prepare:
split: 0.3 # Test set ratio
train:
model_type: LogisticRegression
solver: 'liblinear' # Example parameter for LogisticRegression
C: 1.0 # Regularization strength
Define the DVC Stage: Use dvc stage add
to define this preparation step in dvc.yaml
. This command tells DVC how to run the script, what its dependencies are, and what outputs it produces.
dvc stage add -n prepare \
-p base.seed,prepare.split \
-d src/prepare.py -d data/raw/data.csv \
-o data/processed/train.csv -o data/processed/test.csv \
python src/prepare.py
-n prepare
: Names the stage 'prepare'.-p base.seed,prepare.split
: Declares parameters from params.yaml
used by this stage. DVC will track changes to these specific parameters.-d src/prepare.py -d data/raw/data.csv
: Specifies script and data dependencies. If these change, the stage needs rerunning.-o data/processed/train.csv -o data/processed/test.csv
: Declares the outputs produced by the stage. DVC will start tracking these files.python src/prepare.py
: The command to execute for this stage.Commit Changes: The dvc stage add
command creates/updates dvc.yaml
and dvc.lock
, and adds the output files (data/processed/*.csv
) to DVC tracking (creating .dvc
files for them). Commit these changes to Git.
git add dvc.yaml dvc.lock data/processed/.gitignore src/prepare.py params.yaml
git commit -m "Add DVC stage: prepare data"
Note: DVC automatically adds the output paths (data/processed/train.csv
, data/processed/test.csv
) to .gitignore
(within data/processed/.gitignore
) so Git ignores the large data files themselves.
You can optionally push the DVC-tracked data to remote storage if you have configured one (dvc push
).
This stage trains a model using the processed data, logs the experiment with MLflow, and saves the model and metrics. DVC manages the dependencies (processed data, script, parameters) and outputs (model file, metrics file).
Edit src/train.py
: This script now incorporates MLflow logging.
# src/train.py
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import mlflow
import mlflow.sklearn
import os
import yaml
import pickle
import json
# Load parameters
with open('params.yaml', 'r') as f:
params = yaml.safe_load(f)
# --- MLflow Setup ---
# Optional: Set tracking URI if using a remote server
# mlflow.set_tracking_uri("http://...")
mlflow.set_experiment("Simple Classification")
# Start an MLflow run
with mlflow.start_run():
seed = params['base']['seed']
model_params = params['train']
# --- Log Parameters ---
mlflow.log_param("seed", seed)
mlflow.log_params(model_params) # Log all training params
# --- Load Data ---
train_data_path = 'data/processed/train.csv'
test_data_path = 'data/processed/test.csv'
train_df = pd.read_csv(train_data_path)
test_df = pd.read_csv(test_data_path)
X_train = train_df[['feature1', 'feature2']]
y_train = train_df['target']
X_test = test_df[['feature1', 'feature2']]
y_test = test_df['target']
# --- Log DVC Data Info (Example) ---
# Log the path or hash of input data as a tag
# This requires parsing dvc.lock or using dvc api, simplified here
mlflow.set_tag("train_data_path", train_data_path)
mlflow.set_tag("test_data_path", test_data_path)
# A more robust way involves getting the hash from dvc.lock
# --- Train Model ---
# Using Logistic Regression as defined in params.yaml
if model_params['model_type'] == 'LogisticRegression':
model = LogisticRegression(
solver=model_params['solver'],
C=model_params['C'],
random_state=seed
)
else:
raise ValueError(f"Unsupported model type: {model_params['model_type']}")
model.fit(X_train, y_train)
# --- Evaluate Model ---
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# --- Log Metrics ---
mlflow.log_metric("accuracy", accuracy)
print(f"Model Accuracy: {accuracy:.4f}")
# --- Save and Log Model ---
os.makedirs('models', exist_ok=True)
model_output_path = 'models/model.pkl'
with open(model_output_path, 'wb') as f:
pickle.dump(model, f)
# Log the model using MLflow's scikit-learn integration
mlflow.sklearn.log_model(model, "sklearn-model")
print(f"Model saved to: {model_output_path}")
print(f"Model logged to MLflow run: {mlflow.active_run().info.run_id}")
# --- Save Metrics File (for DVC tracking) ---
metrics_output_path = 'metrics.json'
metrics_data = {'accuracy': accuracy}
with open(metrics_output_path, 'w') as f:
json.dump(metrics_data, f, indent=4)
print(f"Metrics saved to: {metrics_output_path}")
print("MLflow Run Completed.")
Define the DVC Stage: Add the training stage to dvc.yaml
. It depends on the prepare
stage's outputs, the training script, and relevant parameters. It produces the model file and a metrics file.
dvc stage add -n train \
-p base.seed,train \
-d src/train.py -d data/processed/train.csv -d data/processed/test.csv \
-o models/model.pkl \
-m metrics.json \
python src/train.py
-n train
: Names the stage 'train'.-p base.seed,train
: Tracks the general seed and all parameters under the train
section in params.yaml
.-d ...
: Specifies dependencies: the script (src/train.py
) and the outputs of the previous stage (data/processed/*.csv
).-o models/model.pkl
: Declares the model file as an output tracked by DVC.-m metrics.json
: Declares metrics.json
as a metrics file. DVC can parse and display metrics from such files.python src/train.py
: The command to execute.Commit Changes: Commit the updated dvc.yaml
, dvc.lock
, the new training script, the DVC-tracked model placeholder, and the metrics file definition.
git add dvc.yaml dvc.lock src/train.py models/.gitignore metrics.json
# DVC creates models/.gitignore to ignore the actual model file for Git
git commit -m "Add DVC stage: train model with MLflow logging"
Now we have a two-stage pipeline defined in dvc.yaml
.
Execute the Pipeline: Use dvc repro
to run the entire pipeline from start to finish. DVC checks dependencies and executes stages as needed.
dvc repro
You will see output from both prepare.py
and train.py
, including the MLflow logging messages and the final accuracy. DVC will run prepare
first, then train
.
Check Status:
git status
: Should show that dvc.lock
and metrics.json
might have changed (as the pipeline ran). Commit these updates.
git add dvc.lock metrics.json
git commit -m "Run integrated pipeline"
dvc status
: Should show that the pipeline is up-to-date.dvc metrics show
: Displays the metrics tracked from metrics.json
.Inspect MLflow UI: Launch the MLflow UI to see the tracked experiment run:
mlflow ui
Navigate to http://localhost:5000
(or your configured address) in your browser. Find the "Simple Classification" experiment. You should see a run logged with:
seed
, model_type
, solver
, C
.train_data_path
, test_data_path
.accuracy
.sklearn-model
(which includes model.pkl
, conda.yaml
, python_env.yaml
, requirements.txt
).Diagram showing the relationship between Git, DVC-tracked artifacts/stages, and MLflow experiment tracking within the project structure.
Now, let's see the power of this integrated setup when we make changes.
Modify Parameters: Edit params.yaml
and change a training parameter, for example, the regularization strength C
:
# params.yaml
base:
seed: 42
prepare:
split: 0.3
train:
model_type: LogisticRegression
solver: 'liblinear'
C: 0.1 # Changed from 1.0 to 0.1
Reproduce: Run dvc repro
again.
dvc repro
Notice that DVC detects the change in params.yaml
affecting the train
stage. It skips the prepare
stage (because its dependencies haven't changed) and only reruns the train
stage. The training script executes again, logging a new run to MLflow with C=0.1
.
Verify:
dvc.lock
and metrics.json
to Git.
git add dvc.lock metrics.json params.yaml
git commit -m "Experiment: Change C to 0.1"
C
) and the resulting accuracy
.This hands-on practical illustrates how DVC pipelines automate the execution flow based on changes in dependencies (code, data, parameters), while MLflow captures the specifics of each execution (parameters, metrics, artifacts). By committing the DVC metadata (dvc.yaml
, dvc.lock
) alongside your code and parameters in Git, you create a fully versioned and reproducible machine learning workflow. Anyone with access to your Git repository and the DVC remote storage can check out a specific commit and reproduce your exact pipeline and results.
© 2025 ApX Machine Learning