Data Versioning and Lineage with DVC and Pachyderm

While Git has solved code versioning, the machine learning lifecycle introduces a parallel challenge: versioning the data, models, and intermediate artifacts that are just as important as the code itself. Without a system to manage these assets, reproducibility becomes nearly impossible. A git checkout of a specific commit does not guarantee a model rebuild if the underlying dataset has changed. Addressing this reproducibility challenge requires establishing robust data versioning and lineage. Tools and practices for achieving auditable and reliable ML systems through data versioning and lineage are presented.

Data versioning provides a mechanism to track and retrieve specific states of your datasets, similar to how Git tracks source code. Data lineage goes a step further by creating an auditable trail that connects a final model back to its origins: the exact version of the source code, the specific dataset, and the hyperparameters used to produce it.

The Challenge of Reproducibility in Code

In a traditional software project, the build is deterministic: the same code version always produces the same binary. In machine learning, the "build" process (model training) depends on three elements: code, data, and configuration.

Model = f(code, data, configuration)

A version control system like Git only manages the code and, to some extent, the configuration (e.g., in a YAML file). The data component, often consisting of gigabytes or terabytes of files, lives outside this system. If data is modified without being versioned, the link is broken. You can no longer reliably answer questions like:

"What exact data was used to train the production model three months ago?"
"Why is the model I trained today performing worse than last week's version, even with the same code?"
"Can we prove to a regulator that our model was not trained on a biased or unapproved dataset?"

To solve this, we need tools that bring version control principles to the data and pipeline layers. We will examine two popular and philosophically different tools: DVC (Data Version Control) and Pachyderm.

DVC: Extending Git for Data Management

DVC is designed to integrate smoothly into a standard Git-based workflow. It operates on a simple but effective principle: store large files outside of Git, but track pointers to them within Git. This allows you to manage large datasets with the same familiar commands like git checkout, git log, and git push, without bloating your Git repository.

How DVC Works

DVC uses a remote storage backend (e.g., an S3 bucket, Google Cloud Storage, or an SSH server) to hold the actual data. Inside your Git repository, DVC creates small metadata files (ending in .dvc). These are lightweight text files containing an MD5 hash of the data and the location of the file in remote storage.

The workflow is straightforward:

Track Data: You tell DVC to track a data file or directory.

# Add a large dataset directory to DVC tracking
dvc add data/raw_images

# This creates data/raw_images.dvc
# The original directory is added to .gitignore

Commit Metadata: You commit the small .dvc file to Git.

# Stage the .dvc file and the .gitignore update
git add data/raw_images.dvc .gitignore

# Commit the pointer file
git commit -m "feat: track initial raw image dataset"

Push Data: You push the actual data to the configured remote storage.

# This uploads files referenced in raw_images.dvc to your S3 bucket
dvc push

Now, another team member can clone the Git repository and run dvc pull to download the correct version of the data corresponding to their Git commit. Checking out a previous branch with git checkout and running dvc pull will restore the dataset to its exact state at that point in time.

DVC Pipelines for Lineage

Simple file versioning, DVC can define and execute multi-stage ML pipelines. You use dvc stage add or dvc run to specify the dependencies, commands, and outputs for each step. DVC uses this information to build a Directed Acyclic Graph (DAG) of your entire workflow.

# Example of defining a two-step pipeline
# Step 1: Preprocess data
dvc stage add -n preprocess \
              -d src/preprocess.py -d data/raw_images \
              -o data/processed_features \
              python src/preprocess.py --in data/raw_images --out data/processed_features

# Step 2: Train model
dvc stage add -n train \
              -d src/train.py -d data/processed_features \
              -p params.yaml:train \
              -o models/model.pkl \
              python src/train.py --in data/processed_features --out models/model.pkl

This generates a dvc.yaml file that stores the pipeline structure. DVC can now visualize the lineage, showing exactly how your data, code, and parameters are connected to your final model.

A DVC pipeline DAG, illustrating the lineage from raw data and source code to a trained model artifact. Changes to any dependency, like src/preprocess.py or data/raw_images, are detected by DVC.

By running dvc repro, DVC will intelligently re-execute only the stages of the pipeline that have been affected by changes, saving significant computation time.

Pachyderm: Kubernetes-Native Data Pipelines

Pachyderm offers a different approach. It is a data-centric platform built on Kubernetes, designed for automating large-scale data transformations and ML pipelines. Where DVC extends the Git workflow for developers, Pachyderm provides a strong, cluster-level system for data processing.

How Pachyderm Works

Pachyderm's architecture is based on two primary objects:

Data Repositories: These are versioned storage locations for your data, similar in spirit to Git repositories. You create a repo and "commit" data to it. Each commit is an immutable, versioned snapshot of your data.
Pipelines: A pipeline is a Kubernetes job that subscribes to one or more data repositories. It is defined by a specification that includes the input data repositories, a Docker image for the processing code, and the command to execute.

The most important feature is that pipelines are triggered automatically by new data commits. When you commit new data to an input repository, Pachyderm automatically executes the pipeline, processes the new data, and places the results in an output repository as a new commit.

This data-centric triggering mechanism ensures that lineage is automatically captured and enforced. Every piece of output data in Pachyderm can be traced back to the exact input data commits and the pipeline version that produced it. This provides "global" lineage across all data and pipelines in the cluster.

Data-driven pipeline execution in Pachyderm. A commit to the images repository triggers the resize-images pipeline, which produces a new commit in the resized-images repository. This output commit, along with a commit to the labels repository, triggers the train-model pipeline.

This architecture is exceptionally well-suited for production environments where data arrives continuously and pipelines need to be executed reliably and automatically.

Comparing DVC and Pachyderm

Choosing between DVC and Pachyderm depends on your team's workflow, scale, and infrastructure.

Aspect	DVC	Pachyderm
Workflow	Git-centric. Integrates with existing Git workflows.	Data-centric. Pipelines trigger on data commits.
Execution	Imperative. User runs commands like `dvc repro`.	Declarative. User defines a pipeline spec; execution is automatic.
Environment	Local-first. Runs on a developer's machine or CI/CD runner.	Cluster-first. Runs as a platform on Kubernetes.
Orchestration	External. Relies on scripts, Makefiles, or CI/CD systems.	Built-in. Orchestration is a core feature of the platform.
Best For	Individuals and teams prioritizing a developer-friendly, Git-integrated experience for experiments and projects.	Organizations building a centralized, automated ML platform on Kubernetes for production workloads.

Both tools solve the versioning and lineage problem, but they do so from different architectural standpoints. DVC helps the developer, while Pachyderm supports the platform. In many advanced environments, they can even be used together. For example, a data scientist might use DVC locally for experimentation, and once a model is ready, the pipeline logic is translated into a Pachyderm pipeline for production automation.

By incorporating these tools, you transform data from a transient, unmanaged asset into a versioned, auditable component of your ML system. This provides the foundation for reproducibility, easier debugging, and the governance required for enterprise-grade AI.

Was this section helpful?

References

Data Version Control (DVC) Documentation, Iterative, Inc., 2024 - Official and comprehensive guide for using DVC, covering data, model, and pipeline versioning.
Pachyderm Documentation, Pachyderm, Inc., 2024 (Pachyderm, Inc.) - Official and comprehensive guide for Pachyderm, detailing its Kubernetes-native data versioning and pipeline automation.
MLOps: A Comprehensive Definition, Challenges and Best Practices, Kowshik Thopalli, Rakshith Subramanyam, Pavan Turaga, Jayaraman J. Thiagarajan, 2023 International Conference Machine Learning (ICML) 2023 DOI: 10.48550/arXiv.2305.13284 - Highlights the significance of data versioning and lineage for reproducibility and system reliability within MLOps.