Running a single distributed training job is a significant engineering task, but production machine learning requires orchestrating a sequence of interconnected tasks. An entire workflow, from data ingestion and validation to model training, evaluation, and conditional deployment, must be automated, reproducible, and scalable. Manually executing these steps is not only inefficient but also prone to error, especially in a multi-user environment. This is where a workflow orchestrator becomes indispensable.
KubeFlow Pipelines is a Kubernetes-native workflow engine designed specifically for composing, deploying, and managing end-to-end machine learning pipelines. As part of the larger KubeFlow project, its primary function is to provide a structured framework for defining complex task graphs where each step runs as a container on the Kubernetes cluster. This approach allows you to leverage the full power of Kubernetes for resource management, scheduling, and isolation directly within your ML workflow definitions.
A KubeFlow Pipeline is a Directed Acyclic Graph (DAG), where each node in the graph represents a specific task or component.
pandas, or training a model with PyTorch. Because each component is a container, you can specify its exact resource requirements, such as "this training step needs 4 CPUs, 32Gi of RAM, and 1 NVIDIA A100 GPU."The following diagram illustrates a standard pipeline structure. Data flows from one component to the next, with the potential for conditional execution based on the output of a prior step, such as model evaluation metrics.
A typical ML pipeline defined as a DAG. Each step is a containerized component, with transitions representing the flow of artifacts. The training step can be scheduled on specialized GPU nodes.
You define pipelines using the KubeFlow Pipelines SDK in Python. This allows you to use familiar programming constructs to build your workflows. A component is typically created by decorating a Python function.
Consider a simple component for preprocessing data. The function signature uses type annotations from the kfp SDK to declare its inputs and outputs. Here, it takes an input Dataset artifact and produces a new, processed Dataset artifact.
from kfp.dsl import component, Input, Output, Dataset
@component(
base_image="python:3.9-slim",
packages_to_install=["pandas==1.5.3", "scikit-learn==1.2.2"]
)
def preprocess_data(
raw_data: Input[Dataset],
processed_data: Output[Dataset]
):
"""Loads raw data, performs cleaning, and saves the result."""
import pandas as pd
df = pd.read_csv(raw_data.path)
# Perform cleaning operations
df.dropna(inplace=True)
df = df[df['value'] > 0]
df.to_csv(processed_data.path, index=False)
Notice how the @component decorator specifies the container environment, including the base image and required Python packages. KubeFlow uses this information to build a container image for this step or use an existing one, ensuring the execution environment is isolated and reproducible.
With components defined, you assemble them into a pipeline, also using a decorated Python function. The pipeline function defines the flow of data by passing the output of one component as the input to another.
from kfp.dsl import pipeline
# Assume 'train_model_op' is another defined component
# from training_component import train_model_op
@pipeline(
name="simple-training-pipeline",
description="A pipeline that preprocesses data and trains a model."
)
def my_training_pipeline(data_path: str):
# This would be a component to import data from a URL or bucket
# For simplicity, we'll assume a component 'import_data_op' exists
import_task = import_data_op(source_url=data_path)
# Use the preprocessing component defined earlier
preprocess_task = preprocess_data(
raw_data=import_task.outputs["dataset"]
)
# The training component consumes the preprocessed data
train_task = train_model_op(
training_data=preprocess_task.outputs["processed_data"]
)
# You can also specify resource requests for a specific task
train_task.set_gpu_limit(1).set_cpu_limit("8").set_memory_limit("32G")
In this pipeline definition, preprocess_task.outputs["processed_data"] is passed directly to the train_model_op component. KubeFlow handles the underlying mechanics of passing the data, typically by writing the output artifact to a shared object store and providing its path to the next container. The final line demonstrates how you can declaratively request specific hardware for a given step, a feature that integrates directly with the Kubernetes scheduler. The next section on GPU scheduling will cover how the cluster fulfills such a request.
While other workflow orchestrators like Apache Airflow can also manage ML tasks, KubeFlow's advantage lies in its Kubernetes-native architecture. In Airflow, executing a task on Kubernetes often requires using the KubernetesPodOperator, adding a layer of configuration to define pods and resource requests. In KubeFlow, the entire system is built on Kubernetes concepts. This tight integration simplifies defining resource-aware pipelines and managing dependencies in a containerized environment. For ML teams already committed to Kubernetes for training and serving, KubeFlow Pipelines provides a more direct and cohesive orchestration solution.
By using KubeFlow, you transform your ML processes from a series of manual scripts into a version-controlled, automated, and resource-aware system. This provides the foundation needed to manage the complex, multi-step workloads that are characteristic of production machine learning. The next sections will detail how Kubernetes fulfills the resource requests made by these pipelines, from scheduling GPUs to dynamically scaling the cluster itself.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with