While Git has solved code versioning, the machine learning lifecycle introduces a parallel challenge: versioning the data, models, and intermediate artifacts that are just as important as the code itself. Without a system to manage these assets, reproducibility becomes nearly impossible. A git checkout of a specific commit does not guarantee a model rebuild if the underlying dataset has changed. Addressing this reproducibility challenge requires establishing robust data versioning and lineage. Tools and practices for achieving auditable and reliable ML systems through data versioning and lineage are presented.
Data versioning provides a mechanism to track and retrieve specific states of your datasets, similar to how Git tracks source code. Data lineage goes a step further by creating an auditable trail that connects a final model back to its origins: the exact version of the source code, the specific dataset, and the hyperparameters used to produce it.
In a traditional software project, the build is deterministic: the same code version always produces the same binary. In machine learning, the "build" process (model training) depends on three elements: code, data, and configuration.
Model = f(code, data, configuration)
A version control system like Git only manages the code and, to some extent, the configuration (e.g., in a YAML file). The data component, often consisting of gigabytes or terabytes of files, lives outside this system. If data is modified without being versioned, the link is broken. You can no longer reliably answer questions like:
To solve this, we need tools that bring version control principles to the data and pipeline layers. We will examine two popular and philosophically different tools: DVC (Data Version Control) and Pachyderm.
DVC is designed to integrate smoothly into a standard Git-based workflow. It operates on a simple but effective principle: store large files outside of Git, but track pointers to them within Git. This allows you to manage large datasets with the same familiar commands like git checkout, git log, and git push, without bloating your Git repository.
DVC uses a remote storage backend (e.g., an S3 bucket, Google Cloud Storage, or an SSH server) to hold the actual data. Inside your Git repository, DVC creates small metadata files (ending in .dvc). These are lightweight text files containing an MD5 hash of the data and the location of the file in remote storage.
The workflow is straightforward:
# Add a large dataset directory to DVC tracking
dvc add data/raw_images
# This creates data/raw_images.dvc
# The original directory is added to .gitignore
.dvc file to Git.
# Stage the .dvc file and the .gitignore update
git add data/raw_images.dvc .gitignore
# Commit the pointer file
git commit -m "feat: track initial raw image dataset"
# This uploads files referenced in raw_images.dvc to your S3 bucket
dvc push
Now, another team member can clone the Git repository and run dvc pull to download the correct version of the data corresponding to their Git commit. Checking out a previous branch with git checkout and running dvc pull will restore the dataset to its exact state at that point in time.
Simple file versioning, DVC can define and execute multi-stage ML pipelines. You use dvc stage add or dvc run to specify the dependencies, commands, and outputs for each step. DVC uses this information to build a Directed Acyclic Graph (DAG) of your entire workflow.
# Example of defining a two-step pipeline
# Step 1: Preprocess data
dvc stage add -n preprocess \
-d src/preprocess.py -d data/raw_images \
-o data/processed_features \
python src/preprocess.py --in data/raw_images --out data/processed_features
# Step 2: Train model
dvc stage add -n train \
-d src/train.py -d data/processed_features \
-p params.yaml:train \
-o models/model.pkl \
python src/train.py --in data/processed_features --out models/model.pkl
This generates a dvc.yaml file that stores the pipeline structure. DVC can now visualize the lineage, showing exactly how your data, code, and parameters are connected to your final model.
A DVC pipeline DAG, illustrating the lineage from raw data and source code to a trained model artifact. Changes to any dependency, like
src/preprocess.pyordata/raw_images, are detected by DVC.
By running dvc repro, DVC will intelligently re-execute only the stages of the pipeline that have been affected by changes, saving significant computation time.
Pachyderm offers a different approach. It is a data-centric platform built on Kubernetes, designed for automating large-scale data transformations and ML pipelines. Where DVC extends the Git workflow for developers, Pachyderm provides a strong, cluster-level system for data processing.
Pachyderm's architecture is based on two primary objects:
The most important feature is that pipelines are triggered automatically by new data commits. When you commit new data to an input repository, Pachyderm automatically executes the pipeline, processes the new data, and places the results in an output repository as a new commit.
This data-centric triggering mechanism ensures that lineage is automatically captured and enforced. Every piece of output data in Pachyderm can be traced back to the exact input data commits and the pipeline version that produced it. This provides "global" lineage across all data and pipelines in the cluster.
Data-driven pipeline execution in Pachyderm. A commit to the
imagesrepository triggers theresize-imagespipeline, which produces a new commit in theresized-imagesrepository. This output commit, along with a commit to thelabelsrepository, triggers thetrain-modelpipeline.
This architecture is exceptionally well-suited for production environments where data arrives continuously and pipelines need to be executed reliably and automatically.
Choosing between DVC and Pachyderm depends on your team's workflow, scale, and infrastructure.
| Aspect | DVC | Pachyderm |
|---|---|---|
| Workflow | Git-centric. Integrates with existing Git workflows. | Data-centric. Pipelines trigger on data commits. |
| Execution | Imperative. User runs commands like dvc repro. |
Declarative. User defines a pipeline spec; execution is automatic. |
| Environment | Local-first. Runs on a developer's machine or CI/CD runner. | Cluster-first. Runs as a platform on Kubernetes. |
| Orchestration | External. Relies on scripts, Makefiles, or CI/CD systems. | Built-in. Orchestration is a core feature of the platform. |
| Best For | Individuals and teams prioritizing a developer-friendly, Git-integrated experience for experiments and projects. | Organizations building a centralized, automated ML platform on Kubernetes for production workloads. |
Both tools solve the versioning and lineage problem, but they do so from different architectural standpoints. DVC helps the developer, while Pachyderm supports the platform. In many advanced environments, they can even be used together. For example, a data scientist might use DVC locally for experimentation, and once a model is ready, the pipeline logic is translated into a Pachyderm pipeline for production automation.
By incorporating these tools, you transform data from a transient, unmanaged asset into a versioned, auditable component of your ML system. This provides the foundation for reproducibility, easier debugging, and the governance required for enterprise-grade AI.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with