The dvc add command tells DVC which data files or directories it should manage within a project repository. It is the primary way to bring datasets, models, or other large artifacts under DVC's control, replacing the large files themselves with small placeholder files that Git can track efficiently.
Think of dvc add as the equivalent of git add, but specifically designed for data files that you don't want to store directly in your Git history. When you run dvc add on a file or directory, DVC performs several actions behind the scenes:
.dvc/cache. The data is stored using the hash as part of its identifier, preventing duplication even if the same file exists in multiple places in your project..dvc File: It creates a small text file in the original location of your data, adding .dvc to the original name (e.g., data/images.zip becomes data/images.zip.dvc). This file contains metadata, including the hash of the original data, its size, and its path. This .dvc file acts as a pointer or placeholder..gitignore: DVC automatically adds a pattern matching the original data file or directory path to your project's .gitignore file. This is essential because it instructs Git to ignore the large data file(s), preventing accidental commits to your Git repository..dvc FileThe .dvc file is the foundation of how DVC integrates with Git. It's a small, human-readable file, typically in YAML format, containing metadata about the data DVC is tracking. Let's look at an example. If you run dvc add data/raw/iris.csv, the generated data/raw/iris.csv.dvc file might look something like this:
# data/raw/iris.csv.dvc
outs:
- md5: a304afb96070e7f03cecfa36f6517373
size: 3858
path: iris.csv
Here's what the fields mean:
outs: Defines the output(s) tracked by this .dvc file.md5: The unique content hash (checksum) calculated by DVC for iris.csv. If the content of iris.csv changes, this hash will change.size: The size of the original data file in bytes.path: The path to the original data file, relative to the location of the .dvc file.Because .dvc files are small text files, they are perfectly suited for versioning with Git. When you commit a .dvc file, you are essentially recording a pointer to a specific version of your data (identified by the hash) without storing the data itself in Git.
Let's track a single data file:
# Assume you have a dataset file: data/features.csv
# And a trained model file: models/model.pkl
# Track the dataset
dvc add data/features.csv
# Track the model
dvc add models/model.pkl
After running these commands, you will see new files (data/features.csv.dvc, models/model.pkl.dvc) and modifications to .gitignore.
Now, let's track an entire directory containing images:
# Assume you have a directory: data/raw_images/ containing many jpg files
# Track the entire directory
dvc add data/raw_images
This creates a single data/raw_images.dvc file. This file will contain metadata (including hashes) for all the files within the data/raw_images directory at the time dvc add was executed. DVC optimizes storage by caching each file inside the directory individually based on its content hash. The .gitignore file will also be updated to ignore the data/raw_images/ directory itself.
The standard workflow after adding data with DVC is to commit the changes to Git:
dvc add <your_data_file_or_directory>..dvc files and .gitignore with Git: Run git add <your_data_file_or_directory>.dvc .gitignore. (Using git add . often works too, just ensure you understand what's being staged).git commit -m "Track initial dataset version".This sequence ensures that your Git commit history includes the .dvc pointer file, linking your code version to the specific data version managed by DVC.
The process of tracking data with
dvc add: The command takes a large data file from the workspace, stores its content in the DVC cache, and creates a small.dvcmetadata file. This.dvcfile, along with changes to.gitignore, is then tracked by Git using standardgit addandgit commitcommands.
By using dvc add, you establish the important link between your project's code (managed by Git) and its associated data (managed by DVC). This separation keeps your Git repository lean and fast while ensuring that your data's history is reliably tracked. The next step is to learn how to share and retrieve this versioned data using remote storage.
Was this section helpful?
dvc add command, detailing its usage and available options..dvc files, and its integration with Git.© 2026 ApX Machine LearningEngineered with