Version Control for Code with Git

While your machine learning model is the star of the show, the source code that preprocesses data, defines the architecture, and runs the training process is its essential supporting cast. Just as a film's script goes through many revisions, your code will constantly evolve. To manage this evolution without causing chaos, we turn to version control, and the industry standard for version control is Git.

Think of Git as a meticulous lab notebook for your code. Instead of saving different versions of your files with names like train_v1.py, train_v2_fixed.py, and train_final_for_real.py, Git provides a structured system to take "snapshots" of your entire project at any point in time. These snapshots, called commits, create a complete history of every change, making it possible to revisit any past state of your codebase.

For machine learning projects, this is not just a convenience; it's a necessity. It provides the foundation for reproducibility by answering critical questions like, "Which version of the code was used to train the model that is currently in production?"

Why Git is Essential for ML Projects

Using a version control system like Git offers several immediate benefits that are particularly valuable in the machine learning lifecycle.

A Complete History of Your Project: Every time you save a snapshot (commit), you are required to write a message describing what you changed and why. This creates a detailed, searchable logbook of your project's development. If a change introduces a bug or degrades model performance, you can easily pinpoint where things went wrong and revert to a previous, working version.
Safe Experimentation with Branches: Machine learning is inherently experimental. You might want to try a new feature engineering technique, a different algorithm, or new hyperparameter values. Git allows you to create a "branch," which is an independent copy of your codebase. You can freely experiment on this branch without any risk to your main, stable code. If the experiment is successful, you can merge your changes back into the main project. If it fails, you can simply discard the branch.
Collaboration Made Simple: When you work on a team, Git allows multiple people to work on the same codebase simultaneously. It provides mechanisms to merge changes from different developers and manage conflicts when two people have modified the same part of a file. This is managed through remote repositories hosted on platforms like GitHub, GitLab, or Bitbucket.

Core Git Concepts

To use Git effectively, you need to understand a few core ideas. While Git is a powerful tool with many features, you can get very far by mastering just a handful of its components.

The Repository: Your Project's Home

A Git repository (or "repo") is a folder that contains your project's code and a hidden subfolder named .git. This .git folder is where Git stores the entire history of your project, including all the commits and branches. You can have a local repository on your computer and a remote repository stored on a server (like GitHub), which allows for backup and collaboration.

The Commit: A Snapshot in Time

A commit is a snapshot of all the files in your repository at a specific point in time. Each commit has a unique identifier (a hash) and is linked to the commit that came before it, forming a historical chain. Committing is a two-step process: first, you select the changes you want to include (this is called "staging"), and then you save them as a commit with a descriptive message. This process encourages you to group related changes into logical units.

Branches: Parallel Lines of Development

A branch represents an independent line of development. By default, your repository starts with a single branch, usually named main or master. This branch typically holds the stable, production-ready version of your code. When you want to work on a new feature or experiment, you create a new branch that diverges from main. This lets you work in isolation. Once your work is complete and tested, you can merge your feature branch back into the main branch, integrating your new code.

A typical Git branching workflow. Development on a new feature happens on an isolated branch (feature/add-metric) and is later merged back into the stable main branch.

A Basic Git Workflow for ML

Here are the fundamental commands you'll use to version your code in a typical solo workflow.

Initialize a repository: Navigate to your project folder in the terminal and run this command. This creates the hidden .git directory and turns your project into a Git repository.
```
git init
```
Stage your changes: After you've created or modified some files (e.g., preprocess.py), you need to tell Git you want to track them. This is called staging.
```
# Stage a specific file
git add preprocess.py

# Or stage all changed files in the current directory
git add .
```
Commit your changes: Once your changes are staged, save them to the project history with a commit. The -m flag lets you provide a descriptive message.
```
git commit -m "Add initial data preprocessing script"
```
Create and switch to a new branch: Before starting a new experiment, create a new branch. The checkout -b command creates a new branch and immediately switches to it.
```
git checkout -b experiment/new-feature-scaling
```
Now, any commits you make will be on this new branch, leaving the main branch untouched.

Merge the branch: After your experiment is successful, switch back to the main branch and merge the changes from your experiment branch.

# Switch back to the main branch
git checkout main

# Merge the changes from the experiment branch into main
git merge experiment/new-feature-scaling

What Not to Track with Git

Git excels at versioning text-based files like Python scripts (.py), configuration files (.yaml), and text documents (.md). However, it is poorly suited for tracking large binary files. This category includes:

Large datasets (CSV, Parquet files, images)
Trained model files (.pkl, .h5, .pt)
Temporary files and environment-specific folders (e.g., __pycache__)

Attempting to store these large files in Git will quickly bloat your repository, making it slow to download and manage. To tell Git to ignore certain files and folders, you create a special file in your project's root directory called .gitignore. Each line in this file specifies a pattern for files or folders to ignore.

A typical .gitignore for a Python ML project might look like this:

# Ignore Python's cache
__pycache__/

# Ignore virtual environment folders
.venv/
env/

# Ignore large data files
data/
*.csv

# Ignore trained models
models/
*.pkl

By explicitly ignoring data and models, you are acknowledging that they need a different versioning strategy. Git handles the "code" part of reproducibility, setting the stage for other tools to handle the "data" and "model" parts, which we will cover in the following sections.

Was this section helpful?

References

Git - Documentation, Git Community, N/A - Provides comprehensive official reference material and guides for all Git commands and functions.
Pro Git, Scott Chacon and Ben Straub, 2014 (Apress) - A comprehensive guide to Git, from foundational operations to advanced techniques, suitable for all skill levels.
Designing Machine Learning Systems: An Iterative Approach to Development, Deployment, and Maintenance, Chip Huyen, 2022 (O'Reilly Media) - Discusses code version control within the broader context of building and maintaining reliable machine learning systems.