You've defined your machine learning workflow as a series of interconnected stages using dvc stage add
or by editing dvc.yaml
directly. Each stage has defined dependencies (input data, code scripts, parameters) and outputs (processed data, models, metrics). The beauty of defining this Directed Acyclic Graph (DAG) is not just visualization, but automation. DVC provides a powerful command, dvc repro
, to automatically detect changes and re-run the necessary parts of your pipeline to bring everything up-to-date.
dvc repro
Imagine you've tweaked a parameter in your params.yaml
file, updated a data cleaning script, or received a new version of your raw dataset. Manually re-running all subsequent steps is tedious and error-prone. Did you remember to run the feature engineering step after cleaning? Did you retrain the model with the new features?
dvc repro
handles this automatically. When you execute it, DVC performs the following actions:
dvc.yaml
. For each stage, it calculates the current hash (e.g., MD5) of its dependencies (files, directories, parameter values, command).dvc.lock
file the last time the stage was successfully executed.dvc.lock
file with the new hashes of its dependencies and outputs.This process ensures that your pipeline outputs are always consistent with the current state of your inputs, code, and parameters.
dvc repro
Works: Under the HoodThe magic lies in the interplay between dvc.yaml
and dvc.lock
.
dvc.yaml
: Defines the structure of your pipeline. It lists the stages, their commands, dependencies (deps
), parameters (params
), and outputs (outs
).dvc.lock
: Records the state of your pipeline the last time it was successfully run. It stores the exact hashes of all dependencies and outputs for each stage as they were when the stage last completed.When you run dvc repro
, DVC essentially asks: "Does the current state of the project (files, params defined in dvc.yaml
) match the recorded state in dvc.lock
?" If not, it re-executes the necessary commands to synchronize them and updates dvc.lock
.
Let's visualize this flow:
Flowchart illustrating the decision process when
dvc repro
is executed for a single stage.
dvc repro
Executing the command is straightforward. Navigate to your project's root directory (where .dvc
resides) in your terminal and run:
dvc repro
DVC will analyze the pipeline defined in dvc.yaml
and execute any outdated stages. You'll see output indicating which stages are being checked and which are being run.
Sometimes, you might only want to reproduce a specific part of your pipeline. For example, maybe you only changed the training script (train.py
) and you know it only affects the train
stage and the subsequent evaluate
stage. You can tell DVC to target specific stages or even specific output files.
To reproduce a single stage and everything downstream from it:
# Reproduce the 'train' stage and any stages depending on its output
dvc repro train
To reproduce the pipeline only up to the point where a specific file is generated:
# Ensure 'model.pkl' is up-to-date by running its parent stage ('train')
# and any necessary upstream stages if they are outdated.
dvc repro model.pkl
This targeted reproduction can save significant time, especially in complex pipelines where only a small part has been modified.
--dry
Before committing to potentially lengthy computations, you can perform a "dry run". The --dry
flag tells dvc repro
to perform the dependency checking and report which stages would be executed, but without actually running their commands.
dvc repro --dry
This is useful for verifying that your changes have the expected impact on the pipeline execution plan.
--force
Occasionally, you might want to force a stage to re-run even if DVC considers it up-to-date. This could be necessary if a stage has non-deterministic behavior you want to capture again, or if an external factor not tracked by DVC (like a software library update) might affect the output. Use the --force
flag carefully.
To force reproduction of a specific stage:
# Force the 'featurize' stage to run, regardless of dependency changes
dvc repro --force featurize
To force reproduction of the entire pipeline:
# Use with caution!
dvc repro --force
After dvc repro
successfully completes, the dvc.lock
file will be updated to reflect the new state of the pipeline. Since dvc.lock
tracks the state and ensures reproducibility, it's essential to commit this file to Git along with any changes you made to code, parameters (params.yaml
), or the pipeline definition (dvc.yaml
):
# After running 'dvc repro'
git add dvc.lock params.yaml src/train.py dvc.yaml
git commit -m "Update training parameters and retrain model"
# If new data outputs were generated and tracked
dvc push
By using dvc repro
, you ensure that anyone checking out this Git commit can easily reproduce the exact same outputs by running dvc pull
followed by dvc repro
(though if they check out the commit where dvc.lock
is already updated, dvc repro
will typically report everything is up-to-date). This command is fundamental to achieving automated and reliable reproducibility in your DVC-managed projects. In the next sections, we will see how to integrate MLflow logging within these reproducible DVC pipelines.
© 2025 ApX Machine Learning