DVC tracks data files and directories, creating a corresponding .dvc file for each. This small text file contains metadata, such as the hash of the data, and is committed to Git. The actual large data file, however, is not stored in Git; instead, it resides in DVC's cache (typically within your project's .dvc/cache directory). Git ignores these data files via the .gitignore file, which DVC automatically updates during initialization.
At this point, the data exists only on your local machine. To share this data with collaborators or access it from different environments (like a cloud server for training), you need to upload it to a shared location, known as DVC remote storage.
dvc pushThe dvc push command is responsible for uploading data from your local DVC cache to the configured remote storage. It examines the .dvc files tracked by the current Git commit (or your current working directory if changes haven't been committed) and uploads the corresponding data files from your local cache (.dvc/cache) to the remote if they aren't already present there.
Think of it like this:
dvc add <your_data> to update the .dvc file with the new data hash.git add <your_data.dvc> and git commit -m "Update dataset" to record the pointer to this data version in Git.dvc push to upload the actual data content associated with that hash to the remote storage.# Example workflow after adding data
git add data/images.dvc
git commit -m "Add processed images v1.1"
# Now, push the actual data files to the configured remote
dvc push
This command will typically output the number of files being pushed. DVC is efficient; it only uploads files whose hashes are not already present in the remote storage, preventing redundant uploads. If you run dvc push again without changes, it will quickly determine that everything is already synchronized.
dvc pullConversely, the dvc pull command downloads data files from the remote storage into your local DVC cache and places them correctly in your workspace, based on the .dvc files currently present.
This is essential when:
The typical workflow for getting data associated with a specific code version is:
git checkout <branch_name_or_commit_hash>. This updates the .dvc files in your workspace to match that specific point in history.dvc pull. DVC reads the updated .dvc files, identifies the required data hashes, checks if the corresponding data exists in the local cache, and downloads any missing files from the remote storage. It then links these files into your workspace.# Switch to a different branch that might use different data
git checkout experiment-new-feature
# Pull the data corresponding to the .dvc files on this branch
dvc pull
dvc pull ensures that your workspace contains the exact data version that was used when that particular Git commit was made. Like dvc push, it only downloads files that are missing locally, making it efficient.
The commands dvc push and dvc pull are the primary mechanisms for synchronizing the large data files managed by DVC between your local cache and remote storage. They work in tandem with Git: Git manages the lightweight .dvc pointer files, defining which data version belongs to a specific code commit, while DVC manages the heavy lifting of storing and transferring the actual data content referenced by those pointers.
Workflow showing interaction between the local workspace, Git repository, DVC cache, remote Git repository, and DVC remote storage using Git and DVC commands.
By configuring remote storage (covered in the next section) and consistently using dvc push after committing changes to .dvc files, and dvc pull after checking out different code versions, you establish a reliable workflow for managing data versions alongside your code. This ensures that anyone checking out a specific commit can retrieve the exact data associated with it, significantly improving the reproducibility of your machine learning projects.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with