Hands-on Practical: Versioning a Simple ML Project

Theory provides the map, but hands-on practice builds the road. Versioning code, data, and models is fundamental for reproducible machine learning. You will build a versioned ML project from the ground up.

We will use a classic dataset, a simple model, and two essential tools: Git for code and DVC (Data Version Control) for data and models. This exercise will walk you through setting up a project, tracking your assets, running an experiment, and then turning back the clock to reproduce a previous result.

Project Setup and Initializing Git

First, let's create a directory for our project and lay out a standard structure. This organization helps keep the different components of an ML project tidy.

Create a project directory and navigate into it:

mkdir versioned-ml-project
cd versioned-ml-project

Create the subdirectories for our source code, data, and models:
```
mkdir src data models
```
Initialize a Git repository to start tracking our code.
```
git init
```
You should see a message like Initialized empty Git repository in .../.git/.
Next, create a .gitignore file. This file tells Git which files or directories it should ignore. We do not want to track large data files, model artifacts, or Python cache files directly with Git. Create a file named .gitignore with the following content:
```
# Python
__pycache__/
*.pyc

# Data and Models tracked by DVC
/data/*
!/data/.gitkeep
/models/*
!/models/.gitkeep

# DVC cache
.dvc/cache
```
The ! entries with .gitkeep are a common trick to make Git track the otherwise empty data and models directories. You can create these empty files with touch data/.gitkeep models/.gitkeep.

Step 1: Versioning Code

Our first asset to version is the source code. Let's create a simple Python script for our model training.

Create a requirements.txt file to list our project's dependencies.
```
pandas
scikit-learn
dvc
```
Install these dependencies using pip: pip install -r requirements.txt.

Create a training script at src/train.py. For now, it will be a simple placeholder.

# src/train.py
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pickle
import json

print("Training script started.")

# In a real project, you'd load data, train, and save.
# We will fill this out in a later step.

print("Training script finished.")

Now, let's make our first commit to save the initial project structure and code in Git.
```
git add .
git commit -m "Initial project structure and placeholder script"
```

Step 2: Versioning Data with DVC

Our model needs data. We will use a simplified version of the Iris dataset. Because datasets can be large, we will use DVC to track it instead of Git.

Create a file named data/iris.csv with the following content:

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
... (imagine 140 more rows here) ...
7.0,3.2,4.7,1.4,versicolor
6.4,3.2,4.5,1.5,versicolor
5.9,3.0,5.1,1.8,virginica
6.7,3.3,5.7,2.1,virginica

You can download a complete sample iris.csv or use your own. For this exercise, the exact contents are less important than the process of versioning the file.

Initialize DVC in our project repository.
```
dvc init
```
This command creates a .dvc directory where DVC stores its configuration and internal information. It's similar to the .git directory.
Now, let's tell DVC to start tracking our dataset.
```
dvc add data/iris.csv
```
This command does two things:
- It copies data/iris.csv into DVC's internal cache (.dvc/cache).
- It creates a small "pointer file" named data/iris.csv.dvc. This text file contains information DVC needs to find the correct version of the data, including an MD5 hash of its content.
If you look at the contents of data/iris.csv.dvc, you will see something like this:
```
outs:
- md5: 234234abcfd2342342dd234234
  size: 4551
  path: iris.csv
```
This pointer file is lightweight and perfect for storing in Git. Let's commit it.
```
git add data/iris.csv.dvc .gitignore
git commit -m "Track iris.csv with DVC"
```

At this stage, your code is in Git and your data is managed by DVC. The link between them is the .dvc file, which is also tracked by Git.

This diagram illustrates how the tools interact. Your workspace files are committed to Git. Large files are added to the DVC cache, and their pointers are committed to Git. DVC pushes the cached files to remote storage for backup and collaboration.

Step 3: Training and Versioning a Model

Now let's update our script to train a model and save the output. The output will be two files: the model artifact and a file with performance metrics.

Modify src/train.py to include the full training logic.

# src/train.py
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pickle
import json
import os

# Load the dataset
df = pd.read_csv('data/iris.csv')

# Prepare data
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = df['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LogisticRegression(solver='lbfgs', max_iter=200) # Simple hyperparameters
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy}")

# Save the model and metrics
os.makedirs('models', exist_ok=True)
with open('models/model.pkl', 'wb') as f:
    pickle.dump(model, f)

with open('metrics.json', 'w') as f:
    json.dump({'accuracy': accuracy}, f)

Run the script from the root of your project directory.
```
python src/train.py
```
This creates two new files: models/model.pkl (the trained model artifact) and metrics.json (a text file with the accuracy).
The model artifact (.pkl file) is a binary file that can be large, so we will track it with DVC, just like our data. The metrics file is small and text-based, so it is fine to track directly with Git.
```
dvc add models/model.pkl
```
Commit the results of our experiment. This commit now links a version of our code (train.py), the model pointer (model.pkl.dvc), and the resulting metrics (metrics.json).
```
git add src/train.py models/model.pkl.dvc metrics.json
git commit -m "Train initial logistic regression model"
```

You now have a fully reproducible snapshot of your first experiment.

Step 4: Experimenting and Reproducing a Result

The true benefit of this system becomes clear when you start making changes. Let's run a new experiment by changing a model hyperparameter.

Modify src/train.py to use a different max_iter value.

# In src/train.py, change this line:
model = LogisticRegression(solver='lbfgs', max_iter=500) # Changed max_iter

Rerun the training script.
```
python src/train.py
```
This overwrites models/model.pkl and metrics.json with new results. Your accuracy might change slightly.
Let's version this new experiment. First, add the updated model to DVC.
```
dvc add models/model.pkl
```
DVC is smart enough to detect that models/model.pkl has changed and will update its pointer file accordingly.

Commit the new results to Git.

git add src/train.py models/model.pkl.dvc metrics.json
git commit -m "Experiment with max_iter=500"

You now have two complete experiments recorded in your Git history. But what if you need to retrieve the first model you trained?

This is where reproducibility comes in.

First, find the commit hash of your initial model training commit.

git log --oneline

You should see an output like this:

a1b2c3d (HEAD -> master) Experiment with max_iter=500
e4f5g6h Train initial logistic regression model
...

Let's travel back in time to the first model commit. Use the hash from your log.
```
git checkout e4f5g6h
```
Git will revert src/train.py, metrics.json, and models/model.pkl.dvc back to their state at that commit. However, the actual models/model.pkl file in your workspace is still the one from the latest experiment.
This is the final step. Tell DVC to sync your workspace with the information in the current .dvc pointer files.
```
dvc checkout
```
DVC sees that models/model.pkl.dvc points to the old model version and retrieves it from its cache, overwriting the file in your workspace.

If you inspect src/train.py, metrics.json, and use Python to load models/model.pkl, you will find that every single component has been restored to the exact state of your first experiment.

By combining Git for code and DVC for data and models, you have created a system that guarantees you can always inspect, validate, and rebuild any result you have ever produced. This structured approach is a building block for creating reliable and professional machine learning systems.

Was this section helpful?

References

Pro Git, Scott Chacon and Ben Straub, 2014 (Apress) - A comprehensive guide to Git, covering fundamental and advanced concepts for versioning code effectively.
DVC Documentation, Iterative.ai, 2024 - The official and complete guide for Data Version Control (DVC), which is for tracking data and models in ML projects.
Practical MLOps: Operationalizing Machine Learning Models for Production, Noah Gift and Alfredo Deza, 2021 (O'Reilly Media) - Offers practical guidance on MLOps principles, including strategies for versioning, reproducibility, and managing machine learning workflows.
Building Machine Learning Powered Applications: Going from Idea to Product, Emmanuel Ameisen, 2020 (O'Reilly Media) - Discusses best practices for developing and deploying ML applications, with information on ensuring reproducibility through proper versioning of code, data, and models.