Dependency Graph Construction

Raw lineage metadata provides a record of system events but is not sufficient for comprehensive health analysis. While a repository of JSON events describing inputs and outputs tracks what happened, it does not immediately answer critical questions about system health. For example, to determine "which dashboards will break if this table is delayed," these events must be organized into a structured model.

We model the data platform as a directed graph. In this mathematical structure, the components of your data stack, tables, streams, dashboards, and machine learning models, become nodes, while the pipelines and transformation jobs that move data between them become edges. This graph structure allows us to apply standard traversal algorithms to determine root causes and downstream impacts programmatically.

Mathematical Representation of Lineage

We define our data ecosystem as a graph $G = (V, E)$ .

The set of vertices $V$ represents the data assets and the compute units. To maintain clarity in the graph, we distinguish between two types of nodes:

Dataset Nodes: Passive assets where data resides (e.g., Snowflake tables, S3 buckets, Kafka topics).
Job Nodes: Active processes that transform data (e.g., a Spark job, a dbt model run, an Airflow task).

The set of edges $E$ represents the flow of information. An edge exists from a dataset to a job if the dataset is an input, and from a job to a dataset if the dataset is an output.

Formally, if a job $j$ reads from dataset $d_{in}$ and writes to dataset $d_{out}$ , our graph contains two directed edges: $(d_{in}, j) \in E \quad \text{and} \quad (j, d_{out}) \in E$

This bipartite structure, where edges strictly alternate between datasets and jobs, is strictly necessary for accurate lineage. Connecting Table A directly to Table B ( $A \to B$ ) obscures the transformation logic that connects them. By including the job node ( $A \to Job \to B$ ), we preserve metadata regarding how the data was transformed, including the specific version of the code and the runtime parameters used.

A bipartite dependency graph showing the flow from raw data through compute jobs to downstream dashboards.

Ingesting Lineage Events

Building this graph requires processing the stream of lineage events discussed in the previous section. Whether using OpenLineage or a custom format, the parsing logic remains consistent. We read a stream of events, extract the inputs and outputs, and update the graph structure (often represented as an Adjacency List).

In a Python implementation, we can use a dictionary to store the adjacency list. The keys represent the upstream nodes, and the values are lists of downstream nodes.

For example, a simplified OpenLineage event:

event = {
    "job": {"name": "spark_process_sales"},
    "inputs": [{"name": "s3://landing/sales.csv"}],
    "outputs": [{"name": "snowflake://db/public/fact_sales"}]
}

To construct the graph from this event, we parse the relationships and upsert them into our structure. We iterate through inputs to link them to the job, and then link the job to the outputs.

class LineageGraph:
    def __init__(self):
        # Adjacency list: node -> set of downstream nodes
        self.graph = {}

    def add_edge(self, source, target):
        if source not in self.graph:
            self.graph[source] = set()
        self.graph[source].add(target)
        # Ensure target exists in graph even if it has no children
        if target not in self.graph:
            self.graph[target] = set()

    def process_event(self, event):
        job_name = event['job']['name']

        # Link inputs -> job
        for input_node in event.get('inputs', []):
            self.add_edge(input_node['name'], job_name)

        # Link job -> outputs
        for output_node in event.get('outputs', []):
            self.add_edge(job_name, output_node['name'])

# Usage
tracker = LineageGraph()
tracker.process_event(event)

This approach builds a static snapshot of the lineage based on the most recent events. In production systems, you often persist this data in a graph database (like Neo4j or Amazon Neptune) or a relational database with recursive query capabilities to handle the scale of thousands of nodes.

Handling Cycles and Temporality

Most data pipelines are designed as Directed Acyclic Graphs (DAGs), meaning data flows forward and never loops back to a previous step. However, cycles can occur in complex environments. For example, a "Customer 360" table might be built from raw logs, but the raw logs ingestion might query the previous day's "Customer 360" table to deduplicate users.

When constructing the graph, we must decide how to handle these cycles. For reliability engineering, we are usually interested in the execution DAG rather than the abstract definition. If Job A ran at 10:00 AM and Job B ran at 10:05 AM using Job A's output, the dependency is clear. If Job A runs again at 11:00 AM using Job B's output, it is a new instance.

To model this accurately, we often append a run ID or timestamp to the job nodes in the graph: $V_{job} = \{ \text{JobName} + \text{RunID} \}$

This creates a time-series graph where we can trace the lineage of a specific row of data back to the exact millisecond the transformation code executed.

Graph Traversal for Reliability

Once the graph is constructed, we use it to solve reliability problems. The two primary operations are upstream traversal (Root Cause Analysis) and downstream traversal (Impact Analysis).

Upstream Traversal

When an anomaly is detected in a dataset $d_{target}$ (e.g., a data quality test fails), we need to find the source. We perform a search (Breadth-First or Depth-First) reversing the edge direction.

$Ancestors(d_{target}) = \{ v \in V \mid \exists \text{ path from } v \text{ to } d_{target} \}$

This traversal identifies every table and job that contributed to the current state of $d_{target}$ . By intersecting this list with recent alert logs, we can pinpoint if a failure in an upstream raw ingestion job caused the metric deviation in the final report.

Downstream Traversal

Before deploying a schema change or deprecating a table, we perform downstream traversal. This identifies the "Blast Radius" of a change.

$Descendants(d_{source}) = \{ v \in V \mid \exists \text{ path from } d_{source} \text{ to } v \}$

Implementing this effectively requires recursive graph traversal.

def get_impacted_nodes(graph, start_node):
    visited = set()
    stack = [start_node]

    while stack:
        node = stack.pop()
        if node not in visited:
            visited.add(node)
            # Add all children to the stack
            children = graph.get(node, [])
            stack.extend(children)

    return visited

This function returns every asset dependent on start_node. In a governance context, this programmatic check can run in a CI/CD pipeline. If a developer attempts to drop a column that is referenced by a node in the Descendants set, the pipeline can automatically reject the change request.

Granularity: Table vs. Column Lineage

The examples above describe table-level lineage. While useful, it lacks precision. If you have a table with 200 columns and you change one column, table-level lineage will mark every downstream dashboard as "impacted," even if those dashboards only use the other 199 columns.

Column-level lineage treats every column as a node in the graph. The edges represent transformation logic within the SQL or DataFrame operations.

$(TableA.Col1) \to (Job) \to (TableB.Col1)$

Constructing column-level lineage is significantly more complex as it requires parsing the abstract syntax tree (AST) of the SQL queries to understand which input columns contribute to which output columns. Tools like SQLGlot or OpenLineage's column-level facets are used to automate this extraction. Despite the complexity, the graph construction logic remains the same: we simply increase the number of nodes $V$ and edges $E$ to represent the finer granularity.

Was this section helpful?

References

Introduction to Algorithms, Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, 2022 (MIT Press) - This edition provides foundational algorithms and data structures for graph representation (adjacency lists), traversal (BFS, DFS), and Directed Acyclic Graphs (DAGs), which are central to dependency graph construction and analysis.
The DAMA Guide to the Data Management Body of Knowledge (DAMA-DMBOK2), DAMA International, 2017 (Technics Publications) - Offers a comprehensive framework for data management, including guidance on metadata management and data lineage, which is essential for understanding the strategic use of dependency graphs in a broader data governance context.
OpenLineage Specification, OpenLineage Project, Ongoing (Linux Foundation AI & Data) - The official specification document for OpenLineage, which details the data model for capturing, producing, and consuming lineage events, directly referenced in the section for constructing dependency graphs.
A Survey of Data Lineage, Seyed Amin Shahrjerdi, Hadi Zare, and Nima Ghazizadeh, 2020 arXiv preprint arXiv:2009.00665 DOI: 10.48550/arXiv.2009.00665 - Provides an overview of data lineage concepts, architectures, and solutions, including various modeling approaches and their applications in data management systems.