Okay, you've initialized DVC in your project repository. Now, how do you actually tell DVC which data files or directories it should manage? This is where the dvc add
command comes into play. It's the primary way to bring your datasets, models, or other large artifacts under DVC's control, replacing the large files themselves with small placeholder files that Git can track efficiently.
Think of dvc add
as the equivalent of git add
, but specifically designed for data files that you don't want to store directly in your Git history. When you run dvc add
on a file or directory, DVC performs several actions behind the scenes:
.dvc/cache
. The data is stored using the hash as part of its identifier, preventing duplication even if the same file exists in multiple places in your project..dvc
File: It creates a small text file in the original location of your data, adding .dvc
to the original name (e.g., data/images.zip
becomes data/images.zip.dvc
). This file contains metadata, including the hash of the original data, its size, and its path. This .dvc
file acts as a pointer or placeholder..gitignore
: DVC automatically adds a pattern matching the original data file or directory path to your project's .gitignore
file. This is essential because it instructs Git to ignore the large data file(s), preventing accidental commits to your Git repository..dvc
FileThe .dvc
file is the cornerstone of how DVC integrates with Git. It's a small, human-readable file, typically in YAML format, containing metadata about the data DVC is tracking. Let's look at an example. If you run dvc add data/raw/iris.csv
, the generated data/raw/iris.csv.dvc
file might look something like this:
# data/raw/iris.csv.dvc
outs:
- md5: a304afb96070e7f03cecfa36f6517373
size: 3858
path: iris.csv
Here's what the fields mean:
outs
: Defines the output(s) tracked by this .dvc
file.md5
: The unique content hash (checksum) calculated by DVC for iris.csv
. If the content of iris.csv
changes, this hash will change.size
: The size of the original data file in bytes.path
: The path to the original data file, relative to the location of the .dvc
file.Because .dvc
files are small text files, they are perfectly suited for versioning with Git. When you commit a .dvc
file, you are essentially recording a pointer to a specific version of your data (identified by the hash) without storing the data itself in Git.
Let's track a single data file:
# Assume you have a dataset file: data/features.csv
# And a trained model file: models/model.pkl
# Track the dataset
dvc add data/features.csv
# Track the model
dvc add models/model.pkl
After running these commands, you will see new files (data/features.csv.dvc
, models/model.pkl.dvc
) and modifications to .gitignore
.
Now, let's track an entire directory containing images:
# Assume you have a directory: data/raw_images/ containing many jpg files
# Track the entire directory
dvc add data/raw_images
This creates a single data/raw_images.dvc
file. This file will contain metadata (including hashes) for all the files within the data/raw_images
directory at the time dvc add
was executed. DVC optimizes storage by caching each file inside the directory individually based on its content hash. The .gitignore
file will also be updated to ignore the data/raw_images/
directory itself.
The standard workflow after adding data with DVC is to commit the changes to Git:
dvc add <your_data_file_or_directory>
..dvc
files and .gitignore
with Git: Run git add <your_data_file_or_directory>.dvc .gitignore
. (Using git add .
often works too, just ensure you understand what's being staged).git commit -m "Track initial dataset version"
.This sequence ensures that your Git commit history includes the .dvc
pointer file, linking your code version to the specific data version managed by DVC.
The process of tracking data with
dvc add
: The command takes a large data file from the workspace, stores its content in the DVC cache, and creates a small.dvc
metadata file. This.dvc
file, along with changes to.gitignore
, is then tracked by Git using standardgit add
andgit commit
commands.
By using dvc add
, you establish the crucial link between your project's code (managed by Git) and its associated data (managed by DVC). This separation keeps your Git repository lean and fast while ensuring that your data's history is reliably tracked. The next step is to learn how to share and retrieve this versioned data using remote storage.
© 2025 ApX Machine Learning