All Courses

Data Versioning and Experiment Tracking for Machine Learning

Chapter 1: The Need for Reproducibility in Machine Learning

Challenges in Managing ML Projects

Why Git Alone Is Not Sufficient

Defining Reproducibility in ML

Components of a Reproducible ML Workflow

Introduction to Data Versioning Concepts

Introduction to Experiment Tracking Concepts

Quiz for Chapter 1

Chapter 2: Versioning Data with DVC

Data Versioning Strategies

Introducing Data Version Control (DVC)

Setting Up DVC in a Project

Tracking Data Files and Directories

Storing and Retrieving Data Versions

Connecting DVC to Remote Storage (S3, GCS, Azure Blob)

Switching Between Data Versions

Hands-on Practical: Versioning a Dataset

Quiz for Chapter 2

Chapter 3: Tracking Experiments with MLflow

The Importance of Experiment Tracking

Introducing MLflow Tracking

Setting up MLflow

Logging Parameters and Metrics

Logging Artifacts (Models, Plots, Files)

Organizing Runs with Experiments

Using the MLflow UI

Comparing Experiment Runs

Practice: Tracking a Training Run

Chapter 4: Integrating DVC and MLflow for Reproducible Workflows

Connecting Data Versions to Experiments

Structuring Projects for Integration

Logging DVC Metadata in MLflow

Creating DVC Pipelines

Reproducing DVC Pipelines

Tracking DVC Pipeline Metrics

Combining DVC Pipelines and MLflow Tracking

Best Practices for Integrated Workflows

Hands-on Practical: Building an Integrated Pipeline

Defining Reproducibility in ML

Was this section helpful?

References

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications, Chip Huyen, 2022 (O'Reilly Media) - This book provides practical guidance on building robust ML systems, including discussions on experiment tracking, data versioning, and environment management, which are central to reproducibility.
Challenges in Data Management for Machine Learning, Karolina Alexiou, Elena Kharlamova, Elena Zheleva, 2020 Proceedings of the VLDB Endowment, Vol. 13 (VLDB Endowment) DOI: 10.14778/3407914.3407918 - This paper identifies and categorizes data management challenges in ML, with direct connection to data versioning and reproducibility.

© 2025 ApX Machine LearningEngineered with