You've learned about the individual components that make up a TFX pipeline, such as ExampleGen
, Transform
, and Trainer
. Each component performs a specific step in the machine learning workflow. However, defining these components and their dependencies is only part of the story. To run these components as a cohesive, automated, and reliable system, especially in production, you need an orchestrator.
An orchestrator is a system responsible for scheduling, executing, and monitoring the tasks (TFX components) within your pipeline. It interprets the dependencies you've defined between components and ensures they run in the correct order. If a component fails, the orchestrator can handle retries or alert you to the problem. It manages the overall lifecycle of your pipeline runs.
Think of TFX components as the individual instructions in a complex recipe. The orchestrator is the chef who reads the recipe, gathers the ingredients (data artifacts), performs each step in sequence (executes components), and manages the kitchen (compute resources).
While TFX provides a LocalDagRunner
for running pipelines directly in a local environment (useful for development and testing), production environments demand more sophisticated management. Orchestrators provide:
TFX is designed to be orchestrator-agnostic, meaning you can run your TFX pipelines on various platforms. The core TFX pipeline definition remains largely the same, but you'll use a specific Runner
class or configuration to target your chosen orchestrator. Two popular choices in the open-source community are Apache Airflow and Kubeflow Pipelines.
Apache Airflow is a widely adopted, general-purpose workflow orchestrator. Pipelines in Airflow are defined as Directed Acyclic Graphs (DAGs) using Python. TFX provides tools to compile your Python-based TFX pipeline definition into an Airflow DAG.
tfx.orchestration.airflow.AirflowDagRunner
to compile your TFX pipeline into an Airflow DAG file.Kubeflow Pipelines is specifically designed for orchestrating machine learning workflows on Kubernetes. It leverages containers for executing each pipeline step, providing excellent isolation and environment management.
tfx.orchestration.kubeflow.KubeflowDagRunner
(or KubeflowV2DagRunner
for the newer KFP v2 API) to compile your pipeline into a YAML file that can be uploaded and run on a Kubeflow Pipelines deployment.tfx.orchestration.LocalDagRunner
executes components sequentially in your local Python environment. Suitable only for development or small-scale experiments.Regardless of the orchestrator, you define your TFX pipeline in Python by instantiating components and connecting them via their inputs and outputs.
# Example snippet (conceptual)
from tfx.components import CsvExampleGen, StatisticsGen, SchemaGen, ExampleValidator, Transform, Trainer
from tfx.proto import example_gen_pb2
from tfx.orchestration import pipeline
# Assume necessary inputs like data_path, module_file are defined
# 1. Instantiate Components
example_gen = CsvExampleGen(input_base=data_path)
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])
schema_gen = SchemaGen(statistics=statistics_gen.outputs['statistics'], infer_feature_shape=True)
example_validator = ExampleValidator(statistics=statistics_gen.outputs['statistics'], schema=schema_gen.outputs['schema'])
transform = Transform(examples=example_gen.outputs['examples'], schema=schema_gen.outputs['schema'], module_file=transform_module_file)
trainer = Trainer(module_file=trainer_module_file, examples=transform.outputs['transformed_examples'], transform_graph=transform.outputs['transform_graph'], schema=schema_gen.outputs['schema'])
# ... add Evaluator, Pusher etc.
# 2. Define the list of components
components = [
example_gen,
statistics_gen,
schema_gen,
example_validator,
transform,
trainer,
# ...
]
# 3. Create the pipeline definition
pipeline_definition = pipeline.Pipeline(
pipeline_name='my_tfx_pipeline',
pipeline_root='/path/to/pipeline/artifacts', # Managed by the orchestrator
components=components,
enable_cache=True,
metadata_connection_config=None # Configure ML Metadata connection
# beam_pipeline_args relevant for Dataflow/Spark runners
)
# 4. Compile for a specific orchestrator (in a separate script/CLI step)
# Example compilation command (conceptual):
# tfx pipeline compile --engine kubeflow --pipeline_path my_pipeline_definition.py --output_path my_pipeline.yaml
# or using a Runner in Python:
# from tfx.orchestration.kubeflow import KubeflowDagRunner
# KubeflowDagRunner().run(pipeline_definition)
The Python code defines the structure and logic. The compilation step translates this definition into the specific format required by the chosen orchestrator (e.g., an Airflow DAG Python file, a KFP YAML file). The orchestrator then takes this compiled definition and executes it.
A typical dependency graph for a TFX pipeline, as managed by an orchestrator. Components run once their inputs (artifacts shown as labels) are available.
Selecting an orchestrator depends on several factors:
By leveraging an orchestrator, you transform your TFX component definitions into a manageable, automated, and production-ready machine learning system. This allows you to focus on improving your models and data, while the orchestrator handles the operational complexities of running the pipeline consistently and reliably.
© 2025 ApX Machine Learning