Raw lineage metadata provides a record of system events but is not sufficient for comprehensive health analysis. While a repository of JSON events describing inputs and outputs tracks what happened, it does not immediately answer critical questions about system health. For example, to determine "which dashboards will break if this table is delayed," these events must be organized into a structured model.
We model the data platform as a directed graph. In this mathematical structure, the components of your data stack, tables, streams, dashboards, and machine learning models, become nodes, while the pipelines and transformation jobs that move data between them become edges. This graph structure allows us to apply standard traversal algorithms to determine root causes and downstream impacts programmatically.
We define our data ecosystem as a graph .
The set of vertices represents the data assets and the compute units. To maintain clarity in the graph, we distinguish between two types of nodes:
The set of edges represents the flow of information. An edge exists from a dataset to a job if the dataset is an input, and from a job to a dataset if the dataset is an output.
Formally, if a job reads from dataset and writes to dataset , our graph contains two directed edges:
This bipartite structure, where edges strictly alternate between datasets and jobs, is strictly necessary for accurate lineage. Connecting Table A directly to Table B () obscures the transformation logic that connects them. By including the job node (), we preserve metadata regarding how the data was transformed, including the specific version of the code and the runtime parameters used.
A bipartite dependency graph showing the flow from raw data through compute jobs to downstream dashboards.
Building this graph requires processing the stream of lineage events discussed in the previous section. Whether using OpenLineage or a custom format, the parsing logic remains consistent. We read a stream of events, extract the inputs and outputs, and update the graph structure (often represented as an Adjacency List).
In a Python implementation, we can use a dictionary to store the adjacency list. The keys represent the upstream nodes, and the values are lists of downstream nodes.
Consider a simplified OpenLineage event:
event = {
"job": {"name": "spark_process_sales"},
"inputs": [{"name": "s3://landing/sales.csv"}],
"outputs": [{"name": "snowflake://db/public/fact_sales"}]
}
To construct the graph from this event, we parse the relationships and upsert them into our structure. We iterate through inputs to link them to the job, and then link the job to the outputs.
class LineageGraph:
def __init__(self):
# Adjacency list: node -> set of downstream nodes
self.graph = {}
def add_edge(self, source, target):
if source not in self.graph:
self.graph[source] = set()
self.graph[source].add(target)
# Ensure target exists in graph even if it has no children
if target not in self.graph:
self.graph[target] = set()
def process_event(self, event):
job_name = event['job']['name']
# Link inputs -> job
for input_node in event.get('inputs', []):
self.add_edge(input_node['name'], job_name)
# Link job -> outputs
for output_node in event.get('outputs', []):
self.add_edge(job_name, output_node['name'])
# Usage
tracker = LineageGraph()
tracker.process_event(event)
This approach builds a static snapshot of the lineage based on the most recent events. In production systems, you often persist this data in a graph database (like Neo4j or Amazon Neptune) or a relational database with recursive query capabilities to handle the scale of thousands of nodes.
Most data pipelines are designed as Directed Acyclic Graphs (DAGs), meaning data flows forward and never loops back to a previous step. However, cycles can occur in complex environments. For example, a "Customer 360" table might be built from raw logs, but the raw logs ingestion might query the previous day's "Customer 360" table to deduplicate users.
When constructing the graph, we must decide how to handle these cycles. For reliability engineering, we are usually interested in the execution DAG rather than the abstract definition. If Job A ran at 10:00 AM and Job B ran at 10:05 AM using Job A's output, the dependency is clear. If Job A runs again at 11:00 AM using Job B's output, it is a new instance.
To model this accurately, we often append a run ID or timestamp to the job nodes in the graph:
This creates a time-series graph where we can trace the lineage of a specific row of data back to the exact millisecond the transformation code executed.
Once the graph is constructed, we use it to solve reliability problems. The two primary operations are upstream traversal (Root Cause Analysis) and downstream traversal (Impact Analysis).
When an anomaly is detected in a dataset (e.g., a data quality test fails), we need to find the source. We perform a search (Breadth-First or Depth-First) reversing the edge direction.
This traversal identifies every table and job that contributed to the current state of . By intersecting this list with recent alert logs, we can pinpoint if a failure in an upstream raw ingestion job caused the metric deviation in the final report.
Before deploying a schema change or deprecating a table, we perform downstream traversal. This identifies the "Blast Radius" of a change.
Implementing this effectively requires recursive graph traversal.
def get_impacted_nodes(graph, start_node):
visited = set()
stack = [start_node]
while stack:
node = stack.pop()
if node not in visited:
visited.add(node)
# Add all children to the stack
children = graph.get(node, [])
stack.extend(children)
return visited
This function returns every asset dependent on start_node. In a governance context, this programmatic check can run in a CI/CD pipeline. If a developer attempts to drop a column that is referenced by a node in the Descendants set, the pipeline can automatically reject the change request.
The examples above describe table-level lineage. While useful, it lacks precision. If you have a table with 200 columns and you change one column, table-level lineage will mark every downstream dashboard as "impacted," even if those dashboards only use the other 199 columns.
Column-level lineage treats every column as a node in the graph. The edges represent transformation logic within the SQL or DataFrame operations.
Constructing column-level lineage is significantly more complex as it requires parsing the abstract syntax tree (AST) of the SQL queries to understand which input columns contribute to which output columns. Tools like SQLGlot or OpenLineage's column-level facets are used to automate this extraction. Despite the complexity, the graph construction logic remains the same: we simply increase the number of nodes and edges to represent the finer granularity.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with