Chapter 3: Versioning in Machine Learning

A machine learning project is composed of multiple distinct parts: data, source code, and the trained models themselves. If a model's performance changes unexpectedly, how do you determine if the cause was a new dataset, a modification to the feature engineering code, or a different set of hyperparameters? Without a systematic way to track these components, answering such questions is difficult and often impossible. This is where versioning provides the necessary structure.

This chapter introduces the practices required for managing these components to ensure your work is reproducible. We will begin by establishing why reproducibility is a requirement for building maintainable and trustworthy systems. You will then learn techniques for versioning the three main artifacts of an ML project:

Code: Applying version control with tools like Git.
Data: Understanding why and how to track changes in datasets.
Models: Storing and managing trained models to link them to the specific code and data that produced them.

We will also address experiment tracking, which is the process of logging the parameters, metrics, and artifacts associated with each model training run. The chapter concludes with a practical exercise where you will apply these versioning techniques to a simple machine learning project.

Sections

3.1 The Importance of Reproducibility
3.2 Version Control for Code with Git
3.3 Introduction to Data Versioning
3.4 Techniques for Model Versioning
3.5 Managing Experiment Tracking
3.6 Hands-on Practical: Versioning a Simple ML Project