Now, let's apply the concepts we've discussed by building a practical, integrated workflow. This hands-on example demonstrates how DVC and MLflow can work together to manage data, track experiments, and automate pipeline execution for improved reproducibility.We will create a simple machine learning pipeline consisting of two stages: data preparation and model training. DVC will manage the data artifacts and orchestrate the pipeline, while MLflow will track the parameters, metrics, and model produced during the training stage.PrerequisitesEnsure you have the following installed and configured:Python (3.7+)GitDVC (pip install dvc[s3] or equivalent for your chosen remote storage)MLflow (pip install mlflow scikit-learn pandas)Basic understanding of Git commands (git init, git add, git commit).Familiarity with concepts from previous chapters (DVC initialization, adding data, MLflow logging basics).Project SetupFirst, let's set up our project directory structure.Create a new directory for the project and navigate into it:mkdir integrated-pipeline-example cd integrated-pipeline-exampleInitialize Git and DVC:git init dvc initThis creates the .git and .dvc directories. Remember to commit the initial DVC configuration files:git add .dvc .dvcignore git commit -m "Initialize DVC"Create the necessary subdirectories:mkdir data src models data/raw data/processedCreate placeholder files for our code and parameters:touch src/prepare.py src/train.py params.yaml requirements.txtAdd a simple raw dataset. For this example, create a dummy CSV file data/raw/data.csv:feature1,feature2,target 1.0,2.1,0 1.5,2.5,0 1.8,2.9,0 3.2,4.5,1 3.5,5.1,1 4.0,6.0,1 0.8,1.9,0 3.8,5.5,1Add all project structure files to Git:git add data/raw/data.csv src/ params.yaml requirements.txt models/ .gitignore # (Create a .gitignore file if needed, e.g., add __pycache__/, *.pyc, etc.) git commit -m "Initial project structure and raw data"Step 1: Data Preparation Stage (DVC)This stage will take the raw data, perform a simple preparation step (like splitting), and save the processed data. DVC will manage the processed data files.Edit src/prepare.py:# src/prepare.py import pandas as pd from sklearn.model_selection import train_test_split import os import yaml # Ensure processed directory exists os.makedirs('data/processed', exist_ok=True) # Load parameters with open('params.yaml', 'r') as f: params = yaml.safe_load(f) split_ratio = params['prepare']['split'] seed = params['base']['seed'] # Load raw data raw_data_path = 'data/raw/data.csv' df = pd.read_csv(raw_data_path) # Split data train_df, test_df = train_test_split(df, test_size=split_ratio, random_state=seed) # Save processed data train_output_path = 'data/processed/train.csv' test_output_path = 'data/processed/test.csv' train_df.to_csv(train_output_path, index=False) test_df.to_csv(test_output_path, index=False) print(f"Processed data saved:") print(f"- Train set: {train_output_path}") print(f"- Test set: {test_output_path}")Edit params.yaml: Define parameters used in the preparation and upcoming training stages.# params.yaml base: seed: 42 prepare: split: 0.3 # Test set ratio train: model_type: LogisticRegression solver: 'liblinear' # Example parameter for LogisticRegression C: 1.0 # Regularization strengthDefine the DVC Stage: Use dvc stage add to define this preparation step in dvc.yaml. This command tells DVC how to run the script, what its dependencies are, and what outputs it produces.dvc stage add -n prepare \ -p base.seed,prepare.split \ -d src/prepare.py -d data/raw/data.csv \ -o data/processed/train.csv -o data/processed/test.csv \ python src/prepare.py-n prepare: Names the stage 'prepare'.-p base.seed,prepare.split: Declares parameters from params.yaml used by this stage. DVC will track changes to these specific parameters.-d src/prepare.py -d data/raw/data.csv: Specifies script and data dependencies. If these change, the stage needs rerunning.-o data/processed/train.csv -o data/processed/test.csv: Declares the outputs produced by the stage. DVC will start tracking these files.python src/prepare.py: The command to execute for this stage.Commit Changes: The dvc stage add command creates/updates dvc.yaml and dvc.lock, and adds the output files (data/processed/*.csv) to DVC tracking (creating .dvc files for them). Commit these changes to Git.git add dvc.yaml dvc.lock data/processed/.gitignore src/prepare.py params.yaml git commit -m "Add DVC stage: prepare data"Note: DVC automatically adds the output paths (data/processed/train.csv, data/processed/test.csv) to .gitignore (within data/processed/.gitignore) so Git ignores the large data files themselves.You can optionally push the DVC-tracked data to remote storage if you have configured one (dvc push).Step 2: Model Training Stage (DVC + MLflow)This stage trains a model using the processed data, logs the experiment with MLflow, and saves the model and metrics. DVC manages the dependencies (processed data, script, parameters) and outputs (model file, metrics file).Edit src/train.py: This script now incorporates MLflow logging.# src/train.py import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score import mlflow import mlflow.sklearn import os import yaml import pickle import json # Load parameters with open('params.yaml', 'r') as f: params = yaml.safe_load(f) # --- MLflow Setup --- # Optional: Set tracking URI if using a remote server # mlflow.set_tracking_uri("http://...") mlflow.set_experiment("Simple Classification") # Start an MLflow run with mlflow.start_run(): seed = params['base']['seed'] model_params = params['train'] # --- Log Parameters --- mlflow.log_param("seed", seed) mlflow.log_params(model_params) # Log all training params # --- Load Data --- train_data_path = 'data/processed/train.csv' test_data_path = 'data/processed/test.csv' train_df = pd.read_csv(train_data_path) test_df = pd.read_csv(test_data_path) X_train = train_df[['feature1', 'feature2']] y_train = train_df['target'] X_test = test_df[['feature1', 'feature2']] y_test = test_df['target'] # --- Log DVC Data Info (Example) --- # Log the path or hash of input data as a tag # This requires parsing dvc.lock or using dvc api, simplified here mlflow.set_tag("train_data_path", train_data_path) mlflow.set_tag("test_data_path", test_data_path) # A way involves getting the hash from dvc.lock # --- Train Model --- # Using Logistic Regression as defined in params.yaml if model_params['model_type'] == 'LogisticRegression': model = LogisticRegression( solver=model_params['solver'], C=model_params['C'], random_state=seed ) else: raise ValueError(f"Unsupported model type: {model_params['model_type']}") model.fit(X_train, y_train) # --- Evaluate Model --- y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) # --- Log Metrics --- mlflow.log_metric("accuracy", accuracy) print(f"Model Accuracy: {accuracy:.4f}") # --- Save and Log Model --- os.makedirs('models', exist_ok=True) model_output_path = 'models/model.pkl' with open(model_output_path, 'wb') as f: pickle.dump(model, f) # Log the model using MLflow's scikit-learn integration mlflow.sklearn.log_model(model, "sklearn-model") print(f"Model saved to: {model_output_path}") print(f"Model logged to MLflow run: {mlflow.active_run().info.run_id}") # --- Save Metrics File (for DVC tracking) --- metrics_output_path = 'metrics.json' metrics_data = {'accuracy': accuracy} with open(metrics_output_path, 'w') as f: json.dump(metrics_data, f, indent=4) print(f"Metrics saved to: {metrics_output_path}") print("MLflow Run Completed.") Define the DVC Stage: Add the training stage to dvc.yaml. It depends on the prepare stage's outputs, the training script, and relevant parameters. It produces the model file and a metrics file.dvc stage add -n train \ -p base.seed,train \ -d src/train.py -d data/processed/train.csv -d data/processed/test.csv \ -o models/model.pkl \ -m metrics.json \ python src/train.py-n train: Names the stage 'train'.-p base.seed,train: Tracks the general seed and all parameters under the train section in params.yaml.-d ...: Specifies dependencies: the script (src/train.py) and the outputs of the previous stage (data/processed/*.csv).-o models/model.pkl: Declares the model file as an output tracked by DVC.-m metrics.json: Declares metrics.json as a metrics file. DVC can parse and display metrics from such files.python src/train.py: The command to execute.Commit Changes: Commit the updated dvc.yaml, dvc.lock, the new training script, the DVC-tracked model placeholder, and the metrics file definition.git add dvc.yaml dvc.lock src/train.py models/.gitignore metrics.json # DVC creates models/.gitignore to ignore the actual model file for Git git commit -m "Add DVC stage: train model with MLflow logging"Running and Reproducing the PipelineNow we have a two-stage pipeline defined in dvc.yaml.Execute the Pipeline: Use dvc repro to run the entire pipeline from start to finish. DVC checks dependencies and executes stages as needed.dvc reproYou will see output from both prepare.py and train.py, including the MLflow logging messages and the final accuracy. DVC will run prepare first, then train.Check Status:git status: Should show that dvc.lock and metrics.json might have changed (as the pipeline ran). Commit these updates.git add dvc.lock metrics.json git commit -m "Run integrated pipeline"dvc status: Should show that the pipeline is up-to-date.dvc metrics show: Displays the metrics tracked from metrics.json.Inspect MLflow UI: Launch the MLflow UI to see the tracked experiment run:mlflow uiNavigate to http://localhost:5000 (or your configured address) in your browser. Find the "Simple Classification" experiment. You should see a run logged with:Parameters: seed, model_type, solver, C.Tags: train_data_path, test_data_path.Metrics: accuracy.Artifacts: The saved sklearn-model (which includes model.pkl, conda.yaml, python_env.yaml, requirements.txt).digraph G { rankdir=LR; node [shape=box, style=filled, fillcolor="#e9ecef", fontname="sans-serif"]; edge [fontname="sans-serif"]; subgraph cluster_git { label = "Git Repository"; style=filled; fillcolor="#f8f9fa"; Git [label="Git History", shape=folder, fillcolor="#dee2e6"]; Code [label="src/prepare.py\nsrc/train.py", fillcolor="#a5d8ff"]; Params [label="params.yaml", fillcolor="#ffec99"]; DVC_Meta [label="dvc.yaml\ndvc.lock", fillcolor="#bac8ff"]; Metrics_File [label="metrics.json", fillcolor="#b2f2bb"]; Git -> Code; Git -> Params; Git -> DVC_Meta; Git -> Metrics_File; } subgraph cluster_dvc { label = "DVC Tracking"; style=filled; fillcolor="#f8f9fa"; RawData [label="data/raw/data.csv", shape=cylinder, fillcolor="#ced4da"]; ProcessedData [label="data/processed/*.csv", shape=cylinder, fillcolor="#ced4da"]; Model [label="models/model.pkl", shape=cylinder, fillcolor="#ced4da"]; Remote [label="Remote Storage\n(S3, GCS, etc.)", shape=cylinder, fillcolor="#868e96"]; RawData -> ProcessedData [style=dashed, label="prepare stage"]; ProcessedData -> Model [style=dashed, label="train stage"]; DVC_Meta -> ProcessedData [label="controls version"]; DVC_Meta -> Model [label="controls version"]; ProcessedData -> Remote [label="dvc push/pull"]; Model -> Remote [label="dvc push/pull"]; } subgraph cluster_mlflow { label = "MLflow Tracking"; style=filled; fillcolor="#f8f9fa"; MLflowUI [label="MLflow UI / Server", shape=component, fillcolor="#ffd8a8"]; Run [label="Experiment Run", fillcolor="#ffe066"]; Run -> MLflowUI [label="logs to"]; Code -> Run [label="generates"]; Params -> Run [label="logs parameters"]; Metrics_File -> Run [label="logs metrics"]; // Also logged directly Model -> Run [label="logs artifact"]; // Logged directly ProcessedData -> Run [label="logs tag (data info)"]; // Logged directly via script } Code -> ProcessedData [label="prepare.py creates"]; Code -> Model [label="train.py creates"]; Code -> Metrics_File [label="train.py creates"]; Params -> Code [label="influences"]; DVC_Meta -> Code [label="defines stages for"]; }Diagram showing the relationship between Git, DVC-tracked artifacts/stages, and MLflow experiment tracking within the project structure.Iterating on the PipelineNow, let's see the power of this integrated setup when we make changes.Modify Parameters: Edit params.yaml and change a training parameter, for example, the regularization strength C:# params.yaml base: seed: 42 prepare: split: 0.3 train: model_type: LogisticRegression solver: 'liblinear' C: 0.1 # Changed from 1.0 to 0.1Reproduce: Run dvc repro again.dvc reproNotice that DVC detects the change in params.yaml affecting the train stage. It skips the prepare stage (because its dependencies haven't changed) and only reruns the train stage. The training script executes again, logging a new run to MLflow with C=0.1.Verify:Commit the updated dvc.lock and metrics.json to Git.git add dvc.lock metrics.json params.yaml git commit -m "Experiment: Change C to 0.1"Check the MLflow UI again. You will see a second run in the "Simple Classification" experiment. You can select both runs and use the "Compare" feature to see the difference in parameters (C) and the resulting accuracy.This hands-on practical illustrates how DVC pipelines automate the execution flow based on changes in dependencies (code, data, parameters), while MLflow captures the specifics of each execution (parameters, metrics, artifacts). By committing the DVC metadata (dvc.yaml, dvc.lock) alongside your code and parameters in Git, you create a fully versioned and reproducible machine learning workflow. Anyone with access to your Git repository and the DVC remote storage can check out a specific commit and reproduce your exact pipeline and results.