After using dvc add
to start tracking your data files or directories, you have instructed DVC which data it should manage. DVC creates a corresponding .dvc
file, a small text file containing metadata like the hash of the data, which you then commit to Git. However, the actual large data file is not stored in Git; it resides in DVC's cache (typically within your project's .dvc/cache
directory) and is ignored by Git via the .gitignore
file automatically updated by dvc init
.
At this point, the data exists only on your local machine. To share this data with collaborators or access it from different environments (like a cloud server for training), you need to upload it to a shared location, known as DVC remote storage.
dvc push
The dvc push
command is responsible for uploading data from your local DVC cache to the configured remote storage. It examines the .dvc
files tracked by the current Git commit (or your current working directory if changes haven't been committed) and uploads the corresponding data files from your local cache (.dvc/cache
) to the remote if they aren't already present there.
Think of it like this:
dvc add <your_data>
to update the .dvc
file with the new data hash.git add <your_data.dvc>
and git commit -m "Update dataset"
to record the pointer to this data version in Git.dvc push
to upload the actual data content associated with that hash to the remote storage.# Example workflow after adding data
git add data/images.dvc
git commit -m "Add processed images v1.1"
# Now, push the actual data files to the configured remote
dvc push
This command will typically output the number of files being pushed. DVC is efficient; it only uploads files whose hashes are not already present in the remote storage, preventing redundant uploads. If you run dvc push
again without changes, it will quickly determine that everything is already synchronized.
dvc pull
Conversely, the dvc pull
command downloads data files from the remote storage into your local DVC cache and places them correctly in your workspace, based on the .dvc
files currently present.
This is essential when:
The typical workflow for getting data associated with a specific code version is:
git checkout <branch_name_or_commit_hash>
. This updates the .dvc
files in your workspace to match that specific point in history.dvc pull
. DVC reads the updated .dvc
files, identifies the required data hashes, checks if the corresponding data exists in the local cache, and downloads any missing files from the remote storage. It then links these files into your workspace.# Switch to a different branch that might use different data
git checkout experiment-new-feature
# Pull the data corresponding to the .dvc files on this branch
dvc pull
dvc pull
ensures that your workspace contains the exact data version that was used when that particular Git commit was made. Like dvc push
, it only downloads files that are missing locally, making it efficient.
The commands dvc push
and dvc pull
are the primary mechanisms for synchronizing the large data files managed by DVC between your local cache and remote storage. They work in tandem with Git: Git manages the lightweight .dvc
pointer files, defining which data version belongs to a specific code commit, while DVC manages the heavy lifting of storing and transferring the actual data content referenced by those pointers.
Workflow showing interaction between the local workspace, Git repository, DVC cache, remote Git repository, and DVC remote storage using Git and DVC commands.
By configuring remote storage (covered in the next section) and consistently using dvc push
after committing changes to .dvc
files, and dvc pull
after checking out different code versions, you establish a reliable workflow for managing data versions alongside your code. This ensures that anyone checking out a specific commit can retrieve the exact data associated with it, significantly improving the reproducibility of your machine learning projects.
© 2025 ApX Machine Learning