The OpenLineage Standard

Standardization is the mechanism that allows distinct systems to communicate effectively. Data lineage, which tracks the origin and movement of data, can be categorized into static and dynamic forms. Static analysis relies on parsing code to determine lineage, while dynamic lineage requires the data infrastructure itself to report activities as they happen. For this reporting to be useful, the diverse tools in a modern data stack, orchestrators like Airflow, computation engines like Spark, and warehouses like Snowflake, must speak a common language.

OpenLineage provides this language. It is an open API specification that defines how to track data lineage through the lifecycle of a job. Rather than building peer-to-peer integrations between every tool and every catalog, OpenLineage decouples the collection of metadata from its consumption. Tools emit lineage events in a standardized JSON format, and any backend compatible with the standard can consume, store, and visualize that information.

The Core Object Model

The OpenLineage specification models data processing through three primary entities: the Job, the Run, and the Dataset. Understanding the relationship between these entities is necessary for implementing lineage tracking correctly.

Job: A definition of a process. In a data pipeline, this corresponds to a specific task within a DAG or a recurring SQL query. It represents "what" should happen.
Run: An instance of a job execution at a specific point in time. It represents "what" is happening right now. A single job will spawn many runs over time.
Dataset: The inputs and outputs of a job. These can be database tables, S3 buckets, or Kafka topics.

This model transforms the abstract flow of data into a directed graph. A run consumes one or more input datasets and produces one or more output datasets.

The relationship between core OpenLineage entities showing how execution binds inputs to outputs.

The Run Event Structure

Communication in OpenLineage occurs through Run Events. When a data pipeline executes, the integration (such as an Airflow operator or a Spark listener) sends asynchronous events to a backend. A typical lifecycle involves sending a START event when the job begins and a COMPLETE or FAIL event when it finishes.

A standard run event contains the current state of the job, the unique identifier for that specific run, and the inputs and outputs involved. We can define a simplified run event $E$ mathematically as a tuple:

$E = (t, \text{state}, r, j, D_{in}, D_{out})$

Where:

$t$ is the timestamp of the event.
$\text{state}$ is the status (START, COMPLETE, ABORT, FAIL).
$r$ is the globally unique ID (UUID) of the run.
$j$ is the job identifier (namespace and name).
$D_{in}$ is the set of input datasets.
$D_{out}$ is the set of output datasets.

The JSON payload for a run event strictly follows the schema defined by the standard. Below is a structural example of a COMPLETE event indicating that a job has finished writing to a table.

{
  "eventType": "COMPLETE",
  "eventTime": "2023-10-27T14:23:01.52Z",
  "run": {
    "runId": "d46e465b-d358-4d32-83d4-df660ff614dd"
  },
  "job": {
    "namespace": "production_warehouse",
    "name": "daily_revenue_aggregation"
  },
  "inputs": [
    {
      "namespace": "postgres://db.prod:5432",
      "name": "public.orders"
    }
  ],
  "outputs": [
    {
      "namespace": "snowflake://account.region",
      "name": "analytics.revenue_report"
    }
  ]
}

Facets: Atomic Metadata Units

The core model captures the shape of the graph (what connects to what). However, engineering teams often need more granular detail. They need to know the schema of the table, the number of rows written, the SQL query executed, or the version of the code used.

OpenLineage handles this through Facets. A facet is an atomic piece of metadata attached to a Job, Run, or Dataset. Facets are modular and extensible. If a specific integration cannot collect column-level lineage, it simply omits that facet while still reporting the dataset-level dependencies.

Facets are grouped by the entity they describe:

Job Facets: Static details like source code location or owner.
Run Facets: Runtime details like the scheduled time (nominalTime), the batch ID, or the query plan.
Dataset Facets: Details about the data itself, such as schema fields (schema), column statistics (stats), or data quality assertions.

This modularity allows the standard to evolve without breaking existing consumers. If you need to track custom metrics, such as "cost per query," you can define a custom facet without altering the core specification.

Facets attach specific metadata blocks to the core lineage entities, providing context on schema, statistics, and timing.

Architecture and Naming Conventions

Implementing OpenLineage requires adherence to strict naming conventions to ensure the lineage graph connects correctly. Since a dataset might be accessed by Spark using one connection string and by Presto using another, consistent naming is important for graph integrity.

The standard mandates a namespace and name pair for every dataset.

Namespace: Usually identifies the physical context or instance (e.g., postgres://db-prod:5432 or s3://data-lake-bucket).
Name: Identifies the resource within that context (e.g., public.users or /raw/events/2023/).

When configuring your producers (the tools emitting the events), you must configure the namespace resolution logic to be consistent across the stack. If Airflow calls the warehouse "Snowflake-Prod" and dbt calls it "SNOWFLAKE_RAW," the lineage graph will show two disconnected nodes.

The transmission architecture is push-based. The client (the data tool) implements an OpenLineage integration that listens for internal events and translates them into the OpenLineage JSON format. These are then POSTed to an HTTP endpoint exposed by the backend (such as Marquez, Atlan, or DataHub). This design minimizes the performance impact on the pipeline, as the metadata emission happens asynchronously and failure to send metadata does not block the data processing job.

The decoupling of producers and consumers via this standard is what enables the "observability" aspect of modern engineering. We no longer just log that a job failed; we capture the graph state at the moment of failure, including the input schemas and the specific data version that caused the error.

Was this section helpful?

References

OpenLineage: Open Standard for Data Lineage, OpenLineage Project, 2024 - Comprehensive guide to the OpenLineage specification, object model, event structure, and facets.
Marquez Documentation, Marquez Project, 2024 - Documentation for an open-source metadata service that natively supports and consumes OpenLineage events.
Data Observability: How to Ensure the Reliability of Your Data, Barr Moses, Lior Gavish, 2023 (O'Reilly Media) - Discusses the principles of data observability, including the role of data lineage in maintaining data reliability and quality.