While tracking individual data files and experiment runs is valuable, most machine learning projects involve multiple steps: fetching data, cleaning it, transforming features, training a model, and evaluating it. Manually running each step and ensuring the correct inputs and outputs are used becomes tedious and error prone. DVC pipelines provide a way to automate and manage these multi-stage workflows, making them reproducible and easier to manage.
A DVC pipeline defines the sequence of operations, their dependencies (inputs like code scripts and data files), and their outputs. DVC tracks these relationships, allowing it to intelligently determine which stages need to be re-run when something changes.
dvc stage add
The primary command for defining pipeline steps is dvc stage add
. It allows you to encapsulate a single command or script execution as a "stage" within your workflow. DVC records the stage definition, including its dependencies and outputs, in a file named dvc.yaml
.
The basic syntax looks like this:
dvc stage add -n <stage_name> \
-d <dependency1> [-d <dependency2> ...] \
-o <output1> [-o <output2> ...] \
-p <params_file>:<param_section> \
<command_to_run>
Let's break down the options:
-n <stage_name>
: Assigns a unique, human-readable name to this stage (e.g., prepare_data
, train_model
). This name is used internally by DVC.-d <dependency>
: Specifies an input dependency for this stage. This can be a source code file (e.g., src/prepare.py
), a data file tracked by DVC (e.g., data/raw/data.csv
), or even another DVC-tracked output from a previous stage. DVC monitors dependencies; if one changes, the stage needs to be re-run. You can specify multiple -d
flags.-o <output>
: Specifies an output file or directory generated by this stage. DVC will start tracking this output. Similar to dependencies, you can specify multiple -o
flags. Outputs are often used as dependencies for subsequent stages.-p <params_file>:<param_section>
: Specifies parameters used by the stage, typically stored in a separate configuration file (commonly params.yaml
). DVC tracks changes to these parameters. If a relevant parameter changes, DVC knows the stage might produce different results and needs re-running. See more on parameter tracking below.<command_to_run>
: The actual shell command that executes the logic for this stage. This usually involves running a script, like python src/prepare.py
.Example: Data Preparation Stage
Imagine you have a script src/prepare.py
that takes raw data data/raw/iris.csv
(already tracked by dvc add data/raw/iris.csv
) and produces a processed dataset data/prepared/train.csv
. You might define this stage as follows:
# Ensure data/prepared directory exists
mkdir -p data/prepared
# Add the stage
dvc stage add -n prepare \
-d src/prepare.py \
-d data/raw/iris.csv \
-o data/prepared/train.csv \
"python src/prepare.py --input data/raw/iris.csv --output data/prepared/train.csv"
After running this command, DVC creates or updates the dvc.yaml
file with an entry like this:
# dvc.yaml
stages:
prepare:
cmd: python src/prepare.py --input data/raw/iris.csv --output data/prepared/train.csv
deps:
- src/prepare.py
- data/raw/iris.csv
outs:
- data/prepared/train.csv
This YAML entry clearly defines the prepare
stage: the command to execute, its code and data dependencies, and the output it generates. DVC also creates a .dvc
file for the output (data/prepared/train.csv.dvc
) to track its hash, similar to how dvc add
works. Remember to add dvc.yaml
and the new .dvc
file(s) to Git:
git add dvc.yaml data/prepared/train.csv.dvc .gitignore
git commit -m "Add data preparation stage to DVC pipeline"
The real power comes from connecting multiple stages. The outputs of one stage become the dependencies of the next. Let's add a feature engineering stage that depends on the prepared data:
# Assume src/featurize.py exists
mkdir -p data/features
# Add the featurize stage
dvc stage add -n featurize \
-d src/featurize.py \
-d data/prepared/train.csv \
-o data/features/train.pkl \
"python src/featurize.py --input data/prepared/train.csv --output data/features/train.pkl"
# Add changes to Git
git add dvc.yaml data/features/train.pkl.dvc .gitignore
git commit -m "Add feature engineering stage"
Now, dvc.yaml
contains both stages. DVC understands that featurize
depends on the output of prepare
. We can visualize this dependency graph.
A simple pipeline showing dependencies between code, data, and stages. Outputs of one stage (like
data/prepared/train.csv
) become inputs for the next.
You can continue adding stages for training, evaluation, etc., building a complete graph of your ML workflow.
Machine learning experiments often involve tuning hyperparameters or changing configuration settings. DVC pipelines allow you to track these parameters explicitly.
First, create a parameter file, typically params.yaml
:
# params.yaml
prepare:
split: 0.2 # train/test split ratio
seed: 42
featurize:
max_features: 100
ngram_range: [1, 1]
train:
n_estimators: 100
min_split: 2
seed: 42
Now, modify your stage definition using the -p
flag to declare which parameters the stage depends on. You reference parameters using the syntax <filename>:<section>.<parameter_name>
or just <filename>:<section>
to depend on all parameters within that section.
Let's modify the featurize
stage definition to depend on the max_features
parameter:
# Use dvc stage add --force to overwrite the existing stage
dvc stage add -n featurize --force \
-d src/featurize.py \
-d data/prepared/train.csv \
-p params.yaml:featurize.max_features \
-o data/features/train.pkl \
"python src/featurize.py --input data/prepared/train.csv --output data/features/train.pkl --max-features $(params.featurize.max_features)"
Alternatively, depend on the entire featurize
section:
dvc stage add -n featurize --force \
-d src/featurize.py \
-d data/prepared/train.csv \
-p params.yaml:featurize \
-o data/features/train.pkl \
"python src/featurize.py --input data/prepared/train.csv --output data/features/train.pkl --max-features $(params.featurize.max_features) --ngrams $(params.featurize.ngram_range)" # Assuming script handles ngram param
The dvc.yaml
file will now include a params
section:
# dvc.yaml
stages:
prepare:
cmd: python src/prepare.py --input data/raw/iris.csv --output data/prepared/train.csv
deps:
- src/prepare.py
- data/raw/iris.csv
outs:
- data/prepared/train.csv
featurize:
cmd: python src/featurize.py --input data/prepared/train.csv --output data/features/train.pkl --max-features $(params.featurize.max_features) # ... other params
deps:
- src/featurize.py
- data/prepared/train.csv
params: # New section
- featurize: # Corresponds to section in params.yaml
- max_features
- ngram_range
outs:
- data/features/train.pkl
Now, if you modify max_features
or ngram_range
in params.yaml
and commit the change, DVC will know that the featurize
stage (and any subsequent stages depending on its output) needs to be re-run when you execute the pipeline. This explicit tracking of parameters alongside code and data is fundamental for reproducibility.
Don't forget to add params.yaml
to Git:
git add params.yaml dvc.yaml
git commit -m "Add parameter tracking for featurize stage"
By defining your workflow stages, dependencies, outputs, and parameters in dvc.yaml
, you create a blueprint for your project. This blueprint enables DVC to automate the execution and ensure consistency, which we will explore in the next section on reproducing pipelines. This structured approach also forms the basis for integrating MLflow tracking within each automated stage.
© 2025 ApX Machine Learning