Now that we've discussed the concepts behind DVC, let's put them into practice. This hands-on exercise will guide you through the fundamental DVC workflow: initializing DVC in a project, tracking a dataset, configuring remote storage, pushing data, modifying it, and switching between versions.
Before you begin, ensure you have the following installed:
pip install dvc[s3,gcs,azure] # Or just pip install dvc
First, create a new directory for our practice project and navigate into it. Then, initialize both Git and DVC.
mkdir dvc-practice
cd dvc-practice
# Initialize Git repository
git init
Initialized empty Git repository in /path/to/dvc-practice/.git/
# Initialize DVC
dvc init
Initialized DVC repository.
You can now commit the changes to git.
+---------------------------------------------------------------------+
| |
| DVC has enabled anonymous aggregate usage analytics. |
| Read the analytics documentation (and how to opt-out) here: |
| <https://dvc.org/doc/user-guide/analytics> |
| |
+---------------------------------------------------------------------+
What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>
Running dvc init
performs several actions:
.dvc
directory to store DVC's internal information and configuration..dvcignore
, which works like .gitignore
but for DVC-specific patterns..dvc/cache
and other internal paths to your .gitignore
file to prevent Git from tracking the data cache.Let's commit these initial setup files to Git:
git add .dvc .dvcignore .gitignore
git commit -m "Initialize DVC"
We need some data to version. Let's create a data
directory and add a simple CSV file inside it. You can create data/samples.csv
manually or use this small Python script:
# create_data.py
import os
import csv
os.makedirs('data', exist_ok=True)
header = ['id', 'value']
data = [
[1, 10.5],
[2, 15.2],
[3, 20.0]
]
with open('data/samples.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(data)
print("Created data/samples.csv")
Save this as create_data.py
and run it:
python create_data.py
Created data/samples.csv
You should now have a data
directory containing samples.csv
.
Now, let's tell DVC to start tracking this dataset.
dvc add data/samples.csv
You'll see output indicating that DVC is processing the file. This command does two main things:
data/samples.csv
into DVC's cache (located inside .dvc/cache
). The cached file is named based on its content hash (typically MD5).data/samples.csv.dvc
. This file acts as a pointer or metadata file containing information about the original data, including its hash..gitignore
to ensure the actual data file (data/samples.csv
) isn't tracked by Git.Let's examine the .dvc
file:
cat data/samples.csv.dvc
The output will look something like this (the hash will differ based on the exact file content):
outs:
- md5: a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6 # Example hash
path: samples.csv
This file tells DVC that samples.csv
(relative to the .dvc
file's location) is associated with the specified MD5 hash.
The crucial step now is to commit the .dvc
placeholder file and the updated .gitignore
to Git. The actual data (data/samples.csv
) remains untracked by Git because it's listed in .gitignore
.
git add data/samples.csv.dvc .gitignore
git commit -m "Track initial dataset using DVC"
Your Git history now records the state of your data (via the .dvc
file) at this point, without storing the large data file itself.
To share data or back it up, we need to configure remote storage. For simplicity in this exercise, we'll use a directory on your local filesystem outside the project directory to simulate a remote location. In a real project, you'd typically use S3, GCS, Azure Blob, or another supported backend.
# Create a directory to act as remote storage (adjust path if needed)
mkdir /tmp/dvc-practice-storage
# Configure this directory as a DVC remote named 'myremote'
# The -d flag makes it the default remote
dvc remote add -d myremote /tmp/dvc-practice-storage
This command updates the DVC configuration file located at .dvc/config
. Let's commit this configuration change:
git add .dvc/config
git commit -m "Configure local DVC remote storage"
Now, your project knows where to push and pull data from.
With the remote configured, we can push the tracked data (currently residing in the local cache) to the remote storage.
dvc push
DVC checks the .dvc
files associated with the current Git commit, finds the corresponding data files in the local cache (.dvc/cache
), and uploads them to the myremote
location (/tmp/dvc-practice-storage
). If you inspect the /tmp/dvc-practice-storage
directory, you'll find subdirectories named after the first two characters of the MD5 hash, containing the actual data file (also named by its hash).
Let's simulate updating our dataset. Modify data/samples.csv
by adding a new row or changing a value. For instance, add the line 4,25.8
to the end of the file.
id,value
1,10.5
2,15.2
3,20.0
4,25.8 # New row
Now, ask DVC about the status of your data:
dvc status
DVC compares the current data/samples.csv
file with the hash stored in data/samples.csv.dvc
and reports that it has been modified.
To track this new version, use dvc add
again:
dvc add data/samples.csv
This updates the hash inside data/samples.csv.dvc
to reflect the new content and copies the modified file to the cache.
Commit the updated .dvc
file to Git to record this new version:
git add data/samples.csv.dvc
git commit -m "Update dataset (v2)"
Finally, push the new version of the data to the remote storage:
dvc push
Only the new data file (corresponding to the new hash) will be uploaded.
Imagine you've cloned this repository on a new machine, or you've accidentally deleted your local data or cache. Let's simulate this:
# Remove the data file from the working directory
rm data/samples.csv
# (Optional but illustrative) Remove the local DVC cache
rm -rf .dvc/cache
If you run git status
, Git won't notice the missing data/samples.csv
because it's ignored. However, dvc status
will show that the data file tracked by data/samples.csv.dvc
is missing from the workspace.
To restore the data corresponding to the current Git commit (which points to v2 of the data), use dvc pull
:
dvc pull
DVC consults data/samples.csv.dvc
, finds the required hash, downloads the corresponding file from myremote
storage into the local cache (if not already there), and places a copy (or link, depending on configuration) at data/samples.csv
. Verify that data/samples.csv
is now restored with the v2 content (including the row 4,25.8
).
This is where the synergy between Git and DVC becomes apparent. Your Git history tracks different versions of the .dvc
pointer files. You can check out an older commit to work with an older version of the data.
First, find the commit hash for the initial dataset version (you can use git log
). Let's assume the commit message was "Track initial dataset using DVC".
# Check out the previous commit (adjust 'HEAD~1' if needed)
git checkout HEAD~1
Note: switching to 'HEAD~1'.
...
HEAD is now at <commit_hash> Track initial dataset using DVC
Now, your data/samples.csv.dvc
file contains the original hash. However, the actual data/samples.csv
file in your workspace might still be the v2 version (or missing if you followed the previous step closely). Use dvc checkout
to synchronize your workspace data with the .dvc
files in the currently checked-out Git commit:
dvc checkout
M data/samples.csv # DVC indicates it's modifying the file
Check the content of data/samples.csv
. It should now contain the original data (without the row 4,25.8
).
To get back to the latest version:
# Switch back to the main branch (or your working branch)
git checkout main # Or your branch name, e.g., master
# Synchronize data with the latest .dvc file
dvc checkout
M data/samples.csv
Verify that data/samples.csv
again contains the v2 data.
In this hands-on exercise, you successfully used DVC's core commands:
dvc init
: Initialized DVC in a Git repository.dvc add
: Started tracking data files, creating .dvc
metadata files.git commit
: Saved versions of the .dvc
files (data pointers) in Git history.dvc remote add
: Configured a location (local directory in this case) to store data.dvc push
: Uploaded data files from the local cache to remote storage.dvc pull
: Downloaded data files from remote storage based on .dvc
files.git checkout
: Switched between different code/data pointer versions.dvc checkout
: Synchronized the workspace data to match the version specified by the .dvc
files in the current Git commit.You've seen how Git tracks the code and the references to data versions, while DVC manages the actual data files and their synchronization with remote storage. This combination allows you to version large datasets effectively alongside your code, forming a foundation for reproducible machine learning projects.
© 2025 ApX Machine Learning