Machine learning workflows can be defined as a series of interconnected stages using dvc stage add or by editing dvc.yaml directly. Each stage has defined dependencies (input data, code scripts, parameters) and outputs (processed data, models, metrics). Defining a Directed Acyclic Graph (DAG) in this way offers the advantage of not just visualization, but also automation. DVC provides a powerful command, dvc repro, to automatically detect changes and re-run the necessary parts of your pipeline to bring everything up-to-date.
dvc reproImagine you've tweaked a parameter in your params.yaml file, updated a data cleaning script, or received a new version of your raw dataset. Manually re-running all subsequent steps is tedious and error-prone. Did you remember to run the feature engineering step after cleaning? Did you retrain the model with the new features?
dvc repro handles this automatically. When you execute it, DVC performs the following actions:
dvc.yaml. For each stage, it calculates the current hash (e.g., MD5) of its dependencies (files, directories, parameter values, command).dvc.lock file the last time the stage was successfully executed.dvc.lock file with the new hashes of its dependencies and outputs.This process ensures that your pipeline outputs are always consistent with the current state of your inputs, code, and parameters.
dvc repro Works: Under the HoodThe magic lies in the connection between dvc.yaml and dvc.lock.
dvc.yaml: Defines the structure of your pipeline. It lists the stages, their commands, dependencies (deps), parameters (params), and outputs (outs).dvc.lock: Records the state of your pipeline the last time it was successfully run. It stores the exact hashes of all dependencies and outputs for each stage as they were when the stage last completed.When you run dvc repro, DVC essentially asks: "Does the current state of the project (files, params defined in dvc.yaml) match the recorded state in dvc.lock?" If not, it re-executes the necessary commands to synchronize them and updates dvc.lock.
Let's visualize this flow:
Flowchart illustrating the decision process when
dvc reprois executed for a single stage.
dvc reproExecuting the command is straightforward. Navigate to your project's root directory (where .dvc resides) in your terminal and run:
dvc repro
DVC will analyze the pipeline defined in dvc.yaml and execute any outdated stages. You'll see output indicating which stages are being checked and which are being run.
Sometimes, you might only want to reproduce a specific part of your pipeline. For example, maybe you only changed the training script (train.py) and you know it only affects the train stage and the subsequent evaluate stage. You can tell DVC to target specific stages or even specific output files.
To reproduce a single stage and everything downstream from it:
# Reproduce the 'train' stage and any stages depending on its output
dvc repro train
To reproduce the pipeline only up to the point where a specific file is generated:
# Ensure 'model.pkl' is up-to-date by running its parent stage ('train')
# and any necessary upstream stages if they are outdated.
dvc repro model.pkl
This targeted reproduction can save significant time, especially in complex pipelines where only a small part has been modified.
--dryBefore committing to potentially lengthy computations, you can perform a "dry run". The --dry flag tells dvc repro to perform the dependency checking and report which stages would be executed, but without actually running their commands.
dvc repro --dry
This is useful for verifying that your changes have the expected impact on the pipeline execution plan.
--forceOccasionally, you might want to force a stage to re-run even if DVC considers it up-to-date. This could be necessary if a stage has non-deterministic behavior you want to capture again, or if an external factor not tracked by DVC (like a software library update) might affect the output. Use the --force flag carefully.
To force reproduction of a specific stage:
# Force the 'featurize' stage to run, regardless of dependency changes
dvc repro --force featurize
To force reproduction of the entire pipeline:
# Use with caution!
dvc repro --force
After dvc repro successfully completes, the dvc.lock file will be updated to reflect the new state of the pipeline. Since dvc.lock tracks the state and ensures reproducibility, it's essential to commit this file to Git along with any changes you made to code, parameters (params.yaml), or the pipeline definition (dvc.yaml):
# After running 'dvc repro'
git add dvc.lock params.yaml src/train.py dvc.yaml
git commit -m "Update training parameters and retrain model"
# If new data outputs were generated and tracked
dvc push
By using dvc repro, you ensure that anyone checking out this Git commit can easily reproduce the exact same outputs by running dvc pull followed by dvc repro (though if they check out the commit where dvc.lock is already updated, dvc repro will typically report everything is up-to-date). This command is fundamental to achieving automated and reliable reproducibility in your DVC-managed projects. In the next sections, we will see how to integrate MLflow logging within these reproducible DVC pipelines.
Was this section helpful?
dvc repro command, its functionality, and the role of dvc.yaml and dvc.lock files.dvc repro.© 2026 ApX Machine LearningEngineered with