Chapter 1 highlighted the difficulties in managing machine learning projects, particularly when dealing with large datasets that do not fit well within standard Git workflows. This chapter introduces Data Version Control (DVC), an open source tool specifically designed to handle data versioning alongside your code, helping to address these challenges.
We will start by examining different approaches to versioning data before concentrating on DVC's mechanics and how it integrates with Git. You will learn how to:
dvc add
to start tracking data files and directories.dvc push
and dvc pull
to synchronize your data between your local machine and remote storage.The chapter includes practical steps and concludes with a hands on exercise where you will apply these commands to version a sample dataset. By the end of this chapter, you will be equipped to implement effective data versioning in your machine learning projects using DVC.
2.1 Data Versioning Strategies
2.2 Introducing Data Version Control (DVC)
2.3 Setting Up DVC in a Project
2.4 Tracking Data Files and Directories
2.5 Storing and Retrieving Data Versions
2.6 Connecting DVC to Remote Storage (S3, GCS, Azure Blob)
2.7 Switching Between Data Versions
2.8 Hands-on Practical: Versioning a Dataset
© 2025 ApX Machine Learning