As we established in the previous chapter, managing data effectively is fundamental to creating reproducible machine learning workflows. Standard version control systems like Git excel at tracking changes in text-based code files, but they struggle with the scale and nature of typical ML datasets. So, how do teams manage evolving datasets? Let's examine some common strategies, ranging from simple manual methods to more sophisticated approaches, highlighting their strengths and weaknesses.
Perhaps the most basic approach is simply renaming files or directories to indicate different versions. You might have encountered folders like data_processed_v1
, data_processed_v2
, or files such as features_final.csv
, features_final_really_final.csv
.
While simple, manual versioning lacks the rigor needed for serious ML development and reproducibility.
Cloud storage providers (AWS S3, Google Cloud Storage, Azure Blob Storage) often offer their own versioning features. You can enable versioning on a storage bucket, and the provider will keep previous versions of objects when they are overwritten or deleted.
Cloud storage versioning is useful for backup and disaster recovery but doesn't directly address the tight coupling needed between code, data, and experiments in ML.
Git LFS is a Git extension designed to handle large files more efficiently. Instead of storing large binary files directly in the Git repository history (which bloats the repository quickly), Git LFS stores pointers (small text files) in Git. The actual large files are stored on a separate LFS server (which could be self-hosted or provided by services like GitHub, GitLab, Bitbucket). When you check out a commit, Git LFS downloads the required large files based on the pointers.
git add
, git commit
, git push
, git pull
).Git LFS is a definite improvement over storing large files directly in Git, but it's a general-purpose solution for large files, not a tailored solution for the specific needs of data versioning in machine learning.
Recognizing the limitations of the above methods, specialized tools have emerged specifically for versioning data and models within ML projects. Data Version Control (DVC), the focus of this chapter, is a prime example. These tools typically work alongside Git, leveraging Git for code versioning while providing dedicated mechanisms for data.
dvc add
, dvc push
, dvc pull
).Here's a simplified comparison:
Feature | Manual Copying | Cloud Versioning | Git LFS | Specialized Tools (DVC) |
---|---|---|---|---|
Git Integration | None | Poor | Good | Excellent (Alongside) |
Storage | Local copies | Cloud Provider | Separate LFS Server | Flexible (Cloud/Local) |
Reproducibility | Very Low | Low | Moderate | High |
ML Pipeline Aware | No | No | No | Yes |
Scalability | Poor | Good | Moderate | Good |
Granularity | Manual | File-level | File-level | File/Directory/Dataset |
Given the challenges outlined, specialized tools offer the most comprehensive solution for managing data in a reproducible ML setting. They bridge the gap left by Git and generic file storage systems. In the following sections, we will explore how DVC implements these principles, providing a practical and powerful way to version your data alongside your code.
© 2025 ApX Machine Learning