Standardization is the mechanism that allows distinct systems to communicate effectively. Data lineage, which tracks the origin and movement of data, can be categorized into static and dynamic forms. Static analysis relies on parsing code to determine lineage, while dynamic lineage requires the data infrastructure itself to report activities as they happen. For this reporting to be useful, the diverse tools in a modern data stack, orchestrators like Airflow, computation engines like Spark, and warehouses like Snowflake, must speak a common language.
OpenLineage provides this language. It is an open API specification that defines how to track data lineage through the lifecycle of a job. Rather than building peer-to-peer integrations between every tool and every catalog, OpenLineage decouples the collection of metadata from its consumption. Tools emit lineage events in a standardized JSON format, and any backend compatible with the standard can consume, store, and visualize that information.
The OpenLineage specification models data processing through three primary entities: the Job, the Run, and the Dataset. Understanding the relationship between these entities is necessary for implementing lineage tracking correctly.
This model transforms the abstract flow of data into a directed graph. A run consumes one or more input datasets and produces one or more output datasets.
The relationship between core OpenLineage entities showing how execution binds inputs to outputs.
Communication in OpenLineage occurs through Run Events. When a data pipeline executes, the integration (such as an Airflow operator or a Spark listener) sends asynchronous events to a backend. A typical lifecycle involves sending a START event when the job begins and a COMPLETE or FAIL event when it finishes.
A standard run event contains the current state of the job, the unique identifier for that specific run, and the inputs and outputs involved. We can define a simplified run event mathematically as a tuple:
Where:
The JSON payload for a run event strictly follows the schema defined by the standard. Below is a structural example of a COMPLETE event indicating that a job has finished writing to a table.
{
"eventType": "COMPLETE",
"eventTime": "2023-10-27T14:23:01.52Z",
"run": {
"runId": "d46e465b-d358-4d32-83d4-df660ff614dd"
},
"job": {
"namespace": "production_warehouse",
"name": "daily_revenue_aggregation"
},
"inputs": [
{
"namespace": "postgres://db.prod:5432",
"name": "public.orders"
}
],
"outputs": [
{
"namespace": "snowflake://account.region",
"name": "analytics.revenue_report"
}
]
}
The core model captures the shape of the graph (what connects to what). However, engineering teams often need more granular detail. They need to know the schema of the table, the number of rows written, the SQL query executed, or the version of the code used.
OpenLineage handles this through Facets. A facet is an atomic piece of metadata attached to a Job, Run, or Dataset. Facets are modular and extensible. If a specific integration cannot collect column-level lineage, it simply omits that facet while still reporting the dataset-level dependencies.
Facets are grouped by the entity they describe:
nominalTime), the batch ID, or the query plan.schema), column statistics (stats), or data quality assertions.This modularity allows the standard to evolve without breaking existing consumers. If you need to track custom metrics, such as "cost per query," you can define a custom facet without altering the core specification.
Facets attach specific metadata blocks to the core lineage entities, providing context on schema, statistics, and timing.
Implementing OpenLineage requires adherence to strict naming conventions to ensure the lineage graph connects correctly. Since a dataset might be accessed by Spark using one connection string and by Presto using another, consistent naming is important for graph integrity.
The standard mandates a namespace and name pair for every dataset.
postgres://db-prod:5432 or s3://data-lake-bucket).public.users or /raw/events/2023/).When configuring your producers (the tools emitting the events), you must configure the namespace resolution logic to be consistent across the stack. If Airflow calls the warehouse "Snowflake-Prod" and dbt calls it "SNOWFLAKE_RAW," the lineage graph will show two disconnected nodes.
The transmission architecture is push-based. The client (the data tool) implements an OpenLineage integration that listens for internal events and translates them into the OpenLineage JSON format. These are then POSTed to an HTTP endpoint exposed by the backend (such as Marquez, Atlan, or DataHub). This design minimizes the performance impact on the pipeline, as the metadata emission happens asynchronously and failure to send metadata does not block the data processing job.
The decoupling of producers and consumers via this standard is what enables the "observability" aspect of modern engineering. We no longer just log that a job failed; we capture the graph state at the moment of failure, including the input schemas and the specific data version that caused the error.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with