While tracking individual data files and experiment runs is valuable, most machine learning projects involve multiple steps: fetching data, cleaning it, transforming features, training a model, and evaluating it. Manually running each step and ensuring the correct inputs and outputs are used becomes tedious and error prone. DVC pipelines provide a way to automate and manage these multi-stage workflows, making them reproducible and easier to manage.

A DVC pipeline defines the sequence of operations, their dependencies (inputs like code scripts and data files), and their outputs. DVC tracks these relationships, allowing it to intelligently determine which stages need to be re-run when something changes.

Defining Pipeline Stages with `dvc stage add`

The primary command for defining pipeline steps is dvc stage add. It allows you to encapsulate a single command or script execution as a "stage" within your workflow. DVC records the stage definition, including its dependencies and outputs, in a file named dvc.yaml.

The basic syntax looks like this:

dvc stage add -n <stage_name> \
              -d <dependency1> [-d <dependency2> ...] \
              -o <output1> [-o <output2> ...] \
              -p <params_file>:<param_section> \
              <command_to_run>

Let's break down the options:

-n <stage_name>: Assigns a unique, human-readable name to this stage (e.g., prepare_data, train_model). This name is used internally by DVC.
-d <dependency>: Specifies an input dependency for this stage. This can be a source code file (e.g., src/prepare.py), a data file tracked by DVC (e.g., data/raw/data.csv), or even another DVC-tracked output from a previous stage. DVC monitors dependencies; if one changes, the stage needs to be re-run. You can specify multiple -d flags.
-o <output>: Specifies an output file or directory generated by this stage. DVC will start tracking this output. Similar to dependencies, you can specify multiple -o flags. Outputs are often used as dependencies for subsequent stages.
-p <params_file>:<param_section>: Specifies parameters used by the stage, typically stored in a separate configuration file (commonly params.yaml). DVC tracks changes to these parameters. If a relevant parameter changes, DVC knows the stage might produce different results and needs re-running. See more on parameter tracking below.
<command_to_run>: The actual shell command that executes the logic for this stage. This usually involves running a script, like python src/prepare.py.

Example: Data Preparation Stage

Imagine you have a script src/prepare.py that takes raw data data/raw/iris.csv (already tracked by dvc add data/raw/iris.csv) and produces a processed dataset data/prepared/train.csv. You might define this stage as follows:

# Ensure data/prepared directory exists
mkdir -p data/prepared

# Add the stage
dvc stage add -n prepare \
              -d src/prepare.py \
              -d data/raw/iris.csv \
              -o data/prepared/train.csv \
              "python src/prepare.py --input data/raw/iris.csv --output data/prepared/train.csv"

After running this command, DVC creates or updates the dvc.yaml file with an entry like this:

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py --input data/raw/iris.csv --output data/prepared/train.csv
    deps:
    - src/prepare.py
    - data/raw/iris.csv
    outs:
    - data/prepared/train.csv

This YAML entry clearly defines the prepare stage: the command to execute, its code and data dependencies, and the output it generates. DVC also creates a .dvc file for the output (data/prepared/train.csv.dvc) to track its hash, similar to how dvc add works. Remember to add dvc.yaml and the new .dvc file(s) to Git:

git add dvc.yaml data/prepared/train.csv.dvc .gitignore
git commit -m "Add data preparation stage to DVC pipeline"

Building Multi-Stage Pipelines

The real power comes from connecting multiple stages. The outputs of one stage become the dependencies of the next. Let's add a feature engineering stage that depends on the prepared data:

# Assume src/featurize.py exists
mkdir -p data/features

# Add the featurize stage
dvc stage add -n featurize \
              -d src/featurize.py \
              -d data/prepared/train.csv \
              -o data/features/train.pkl \
              "python src/featurize.py --input data/prepared/train.csv --output data/features/train.pkl"

# Add changes to Git
git add dvc.yaml data/features/train.pkl.dvc .gitignore
git commit -m "Add feature engineering stage"

Now, dvc.yaml contains both stages. DVC understands that featurize depends on the output of prepare. We can visualize this dependency graph.

A simple pipeline showing dependencies between code, data, and stages. Outputs of one stage (like data/prepared/train.csv) become inputs for the next.

You can continue adding stages for training, evaluation, etc., building a complete graph of your ML workflow.

Tracking Parameters

Machine learning experiments often involve tuning hyperparameters or changing configuration settings. DVC pipelines allow you to track these parameters explicitly.

First, create a parameter file, typically params.yaml:

# params.yaml
prepare:
  split: 0.2 # train/test split ratio
  seed: 42

featurize:
  max_features: 100
  ngram_range: [1, 1]

train:
  n_estimators: 100
  min_split: 2
  seed: 42

Now, modify your stage definition using the -p flag to declare which parameters the stage depends on. You reference parameters using the syntax <filename>:<section>.<parameter_name> or just <filename>:<section> to depend on all parameters within that section.

Let's modify the featurize stage definition to depend on the max_features parameter:

# Use dvc stage add --force to overwrite the existing stage
dvc stage add -n featurize --force \
              -d src/featurize.py \
              -d data/prepared/train.csv \
              -p params.yaml:featurize.max_features \
              -o data/features/train.pkl \
              "python src/featurize.py --input data/prepared/train.csv --output data/features/train.pkl --max-features $(params.featurize.max_features)"

Alternatively, depend on the entire featurize section:

dvc stage add -n featurize --force \
              -d src/featurize.py \
              -d data/prepared/train.csv \
              -p params.yaml:featurize \
              -o data/features/train.pkl \
              "python src/featurize.py --input data/prepared/train.csv --output data/features/train.pkl --max-features $(params.featurize.max_features) --ngrams $(params.featurize.ngram_range)" # Assuming script handles ngram param

The dvc.yaml file will now include a params section:

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py --input data/raw/iris.csv --output data/prepared/train.csv
    deps:
    - src/prepare.py
    - data/raw/iris.csv
    outs:
    - data/prepared/train.csv
  featurize:
    cmd: python src/featurize.py --input data/prepared/train.csv --output data/features/train.pkl --max-features $(params.featurize.max_features) # ... other params
    deps:
    - src/featurize.py
    - data/prepared/train.csv
    params: # New section
    - featurize: # Corresponds to section in params.yaml
      - max_features
      - ngram_range
    outs:
    - data/features/train.pkl

Now, if you modify max_features or ngram_range in params.yaml and commit the change, DVC will know that the featurize stage (and any subsequent stages depending on its output) needs to be re-run when you execute the pipeline. This explicit tracking of parameters alongside code and data is fundamental for reproducibility.

Don't forget to add params.yaml to Git:

git add params.yaml dvc.yaml
git commit -m "Add parameter tracking for featurize stage"

By defining your workflow stages, dependencies, outputs, and parameters in dvc.yaml, you create a blueprint for your project. This blueprint enables DVC to automate the execution and ensure consistency, which we will explore in the next section on reproducing pipelines. This structured approach also forms the basis for integrating MLflow tracking within each automated stage.

Creating DVC Pipelines

Defining Pipeline Stages with dvc stage add

Building Multi-Stage Pipelines

Tracking Parameters

Defining Pipeline Stages with `dvc stage add`