While dvc add
tells DVC which files or directories to track, and .dvc
files act as pointers stored in Git, the actual large data files need a home outside your Git repository. This is where remote storage comes in. DVC supports various storage backends, including popular cloud providers like AWS S3, Google Cloud Storage (GCS), and Azure Blob Storage, as well as network drives or even just another directory on your local machine.
Think of DVC remote storage as the designated location where the versioned data content, managed by DVC, resides. Your Git repository holds the small .dvc
files (metadata and pointers), while the remote storage holds the actual data chunks identified by hashes. This setup allows your Git repository to remain small and fast, while still providing access to large datasets associated with specific code versions.
To tell DVC where to push data to and pull data from, you use the dvc remote add
command. The basic syntax is:
dvc remote add <remote_name> <remote_url>
<remote_name>
: This is a short, memorable name you choose for the remote storage configuration (e.g., my-s3
, gcp-storage
, azure-data
). By convention, origin
is often used if you only have one primary remote, similar to Git.<remote_url>
: This specifies the type and location of the storage. The format depends on the storage provider.Let's look at how to configure remotes for the most common cloud providers.
If you use AWS S3, the URL typically looks like s3://<bucket_name>/<optional_path>
.
For example, to configure an S3 bucket named my-ml-data-bucket
and store data under a path project-alpha/datasets
, you would run:
dvc remote add my_s3_storage s3://my-ml-data-bucket/project-alpha/datasets
You can also make this the default remote, meaning commands like dvc push
and dvc pull
will use it automatically unless another remote is specified. Use the -d
or --default
flag:
dvc remote add -d origin s3://my-ml-data-bucket/project-alpha/datasets
Authentication: DVC uses the standard AWS credentials mechanisms. It will automatically look for credentials in the following order:
AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, etc.).~/.aws/credentials
).~/.aws/config
).Ensure your environment is configured correctly using tools like the AWS CLI (aws configure
) or by setting environment variables.
For GCS, the URL format is gs://<bucket_name>/<optional_path>
.
To configure a GCS bucket named my-gcp-project-bucket
and store data under ml-experiments/data
, you'd run:
dvc remote add gcs_data gs://my-gcp-project-bucket/ml-experiments/data
To set it as the default:
dvc remote add -d gcs_default gs://my-gcp-project-bucket/ml-experiments/data
Authentication: DVC typically relies on the authentication set up for the gcloud
command-line tool or service account credentials specified via the GOOGLE_APPLICATION_CREDENTIALS
environment variable.
For Azure Blob Storage, the URL format is azure://<container_name>/<optional_path>
. You need to have an Azure Storage account and a Blob container within it.
Assuming your container is named ml-datasets
within your storage account, you can configure it like this:
dvc remote add azure_data azure://ml-datasets/project-x/raw
And as the default:
dvc remote add -d azure_default azure://ml-datasets/project-x/raw
Authentication: DVC leverages Azure's standard authentication methods. This often involves being logged in via the Azure CLI (az login
) or setting environment variables like AZURE_STORAGE_CONNECTION_STRING
or AZURE_STORAGE_ACCOUNT_NAME
and AZURE_STORAGE_ACCOUNT_KEY
. Using managed identities or service principals is also supported.
DVC supports other storage types as well:
dvc remote add local_backup /path/to/external/dvc-storage
dvc remote add my_server ssh://user@example.com:/home/user/dvc-storage
Once you add a remote, the configuration is saved in your project's .dvc/config
file. This is a plain text file that looks something like this:
[core]
remote = origin
['remote "origin"']
url = s3://my-ml-data-bucket/project-alpha/datasets
['remote "gcs_data"']
url = gs://my-gcp-project-bucket/ml-experiments/data
Since this configuration file dictates where your data lives, it's important to commit .dvc/config
to your Git repository. This ensures that anyone cloning your repository can configure their access to the same remote storage (though they'll need their own credentials) and use dvc pull
to retrieve the correct data versions.
You can view your configured remotes anytime using:
dvc remote list
If you need to change settings for a remote (e.g., update credentials or modify specific options), you can use dvc remote modify
. For instance, to explicitly set AWS credentials for the origin
remote (though usually relying on the environment is preferred):
# Example: Setting specific profile (less common, usually rely on environment)
dvc remote modify origin profile my_aws_profile
# Example: Setting Azure connection string explicitly
dvc remote modify azure_data connection_string "your_connection_string"
Consult the DVC documentation for specific options available for each remote type.
To remove a remote configuration, use dvc remote remove <remote_name>
.
With your remote storage configured, you now have the complete picture: Git tracks your code and the small .dvc
pointer files, while DVC manages the large data files, synchronizing them between your local workspace and the configured remote storage using dvc push
and dvc pull
. This separation is fundamental to managing data effectively in machine learning projects alongside code version control.
© 2025 ApX Machine Learning