Hands-on Practical: Create a Simple CI Pipeline with GitHub Actions

Construct a simple CI pipeline that implements Continuous Integration principles. This pipeline uses GitHub Actions, a powerful automation tool integrated directly into GitHub, to automatically test machine learning code whenever changes are made.

This hands-on exercise will guide you through setting up a workflow that checks for code quality and runs basic tests, ensuring that our project remains stable and reliable.

Prerequisites for the Exercise

To follow along, you will need a GitHub account and a new, empty repository. We will create a small but complete machine learning project inside this repository.

First, create these three files in your project's root directory:

A data file (data.csv): A small sample of data. For this example, we can use a few lines from the famous Iris dataset.

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
6.2,2.9,4.3,1.3,versicolor
5.9,3.0,5.1,1.8,virginica

A Python training script (train.py): This script will train a simple model using our data. Note that it doesn't do much, its main purpose is to be a runnable piece of code for our CI pipeline to check.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

def run_training():
    """
    A simple function to load data and train a model.
    """
    # Load the dataset
    df = pd.read_csv('data.csv')

    # Define features and target
    X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
    y = df['species']

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )

    # Initialize and train the model
    model = LogisticRegression(max_iter=200)
    model.fit(X_train, y_train)

    print("Training completed successfully.")

if __name__ == '__main__':
    run_training()

A dependencies file (requirements.txt): This file lists the Python libraries our project needs. Our CI pipeline will use this file to create a consistent environment. We are including flake8 for code linting and pytest for testing.
```
pandas
scikit-learn
flake8
pytest
```

Commit these three files to the main branch of your GitHub repository. With our project structure in place, we can now define our automation.

Understanding GitHub Actions Workflows

A GitHub Actions workflow is an automated process defined in a YAML file. You store these workflow files in a special directory within your repository: .github/workflows/. When an event occurs in your repository, like a code push, GitHub can trigger the corresponding workflow.

Our goal is to create a workflow that performs three main tasks:

Sets up a clean environment with Python and our required libraries.
Lints the code to check for style issues.
Runs a simple test to ensure the training script is functional.

Creating Your First CI Workflow

In your repository, create a new directory named .github, and inside it, another directory named workflows. Inside .github/workflows/, create a new file named ci-pipeline.yml.

Copy the following YAML content into ci-pipeline.yml. We will then break down what each part does.

# A descriptive name for your workflow
name: Basic ML Code CI

# Trigger the workflow on pushes to the main branch
on:
  push:
    branches: [ "main" ]

# Define the jobs to be run
jobs:
  test-and-lint:
    # Use the latest version of Ubuntu as the runner environment
    runs-on: ubuntu-latest

    # Define the sequence of steps in the job
    steps:
      # Step 1: Check out your repository code
      - name: Check out repository code
        uses: actions/checkout@v4

      # Step 2: Set up a specific version of Python
      - name: Set up Python 3.9
        uses: actions/setup-python@v5
        with:
          python-version: '3.9'

      # Step 3: Install project dependencies from requirements.txt
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt

      # Step 4: Lint code with flake8 to check for style issues
      - name: Lint with flake8
        run: |
          # stop the build if there are Python syntax errors or undefined names
          flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
          # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
          flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

      # Step 5: Run a basic test to ensure the script executes
      - name: Test with pytest
        run: |
          pytest

Breakdown of the Workflow File

name: This is a simple, human-readable name that will appear in the "Actions" tab of your GitHub repository.
on: push: branches: [ "main" ]: This is the trigger. It tells GitHub to run this workflow every time someone pushes a commit to the main branch.
jobs: Workflows are made up of one or more jobs. Our workflow has a single job named test-and-lint.
runs-on: ubuntu-latest: This specifies that the job will run on a fresh virtual machine hosted by GitHub, using the latest version of the Ubuntu operating system.
steps: This is the most important part. It defines a sequence of tasks that the job will execute.
- uses: actions/checkout@v4: This is a pre-built action that downloads a copy of your repository's code onto the runner, so the subsequent steps can access your files.
- uses: actions/setup-python@v5: Another pre-built action that installs a specific version of Python, in this case, version 3.9.
- run: pip install -r requirements.txt: The run command executes command-line instructions. Here, we use pip to install all the libraries listed in our requirements.txt file.
- run: flake8 . --count ...: This step runs the flake8 linter. A linter analyzes code for potential errors and stylistic issues without actually running it. This is a standard practice for maintaining code quality.
- run: pytest: This final step executes the pytest testing framework. We haven't created any tests yet, so let's do that now.

Adding a Simple Test

For our CI pipeline to be meaningful, it needs something to test. Let's create a very basic test that confirms our train.py script can be imported and executed without crashing.

Create a new file in your project's root directory named test_script.py:

import pytest
from train import run_training

def test_training_runs():
    """
    Tests if the training function executes without raising an exception.
    """
    try:
        run_training()
    except Exception as e:
        pytest.fail(f"run_training() raised an exception: {e}")

This test is simple but effective for a CI check. It imports the run_training function and calls it. If any error occurs during its execution, pytest.fail() will be triggered, causing the test step in our pipeline to fail.

Visualizing the Pipeline

The workflow we've defined follows a clear, linear sequence of steps. This process ensures that before any code is deemed "good," it has been checked out, its environment has been built, and it has passed both quality and functional checks.

The sequence of automated steps in our Continuous Integration workflow. The process begins with a code push and proceeds through setup, linting, and testing.

Running the Workflow and Seeing the Results

You are now ready to see your pipeline in action.

Add, commit, and push the ci-pipeline.yml and test_script.py files to your GitHub repository.
Navigate to your repository on the GitHub website and click on the "Actions" tab.
You will see your "Basic ML Code CI" workflow listed. GitHub automatically detected the new workflow file and triggered a run because you pushed to the main branch.

Click on the workflow run to see the details. You can expand the test-and-lint job to see the log output for each step. If all steps complete successfully, you will see a green checkmark next to them. If a step fails, for instance, if flake8 finds a syntax error or a pytest test fails, the step will be marked with a red "X," and the entire workflow run will be marked as failed.

You have now successfully built a foundational CI pipeline. This simple automation adds a significant layer of safety and quality control to your project. It acts as an automated gatekeeper, ensuring that every change to your main branch is automatically vetted, freeing you to focus on developing new features. This is a fundamental practice in building reliable and maintainable machine learning systems.

Was this section helpful?

References

GitHub Actions Documentation, GitHub Docs, 2024 (GitHub) - Official guide for GitHub Actions, covering workflow syntax, event triggers, and runner environments.
Introducing MLOps: How to go from Model to Production, Mark Treveil, Nicolas Omont, Clement Stenac, Kenji Lefevre, Du Phan, Joachim Zentici, Adrien Lavoillotte, Makoto Miyazaki, Lynn Heidmann, 2020 (O'Reilly Media) - Resource on MLOps practices, including the integration of CI/CD into machine learning workflows.
Continuous Integration, Martin Fowler, 2024 - Article defining the principles and practices of Continuous Integration, supporting the theoretical basis of CI pipelines.