Managing a data lake without visibility into how data moves is effectively flying blind. When a dashboard in the Gold layer reports incorrect revenue figures, engineering teams often spend hours tracing the error back through layers of transformations. They need to answer specific questions: Did the error originate in the Silver layer aggregation logic, or did the raw Bronze ingestion pipeline fail? Data lineage provides the map required for understanding these dependencies.
Lineage is the recording of data origins and the transformations it undergoes over time. In a modern data lake architecture, this is not merely a documentation exercise but a technical requirement for debugging, compliance, and impact analysis. When you alter a schema in the Bronze layer, lineage allows you to programmatically determine which downstream Silver and Gold tables will break.
At a fundamental level, data lineage is represented as a Directed Acyclic Graph (DAG). In this graph, the nodes represent data assets (tables, files, views) and the processes (Spark jobs, SQL queries) that act upon them. The edges represent the flow of data.
We can define a transformation job as a function that accepts a set of input datasets and produces a set of output datasets .
In a complex environment, the output of one job becomes the input for the next. If job produces dataset , and job consumes to produce , we establish a lineage chain: .
There are two primary levels of granularity when implementing lineage:
silver_orders is derived from bronze_orders and bronze_customers. This is sufficient for most impact analysis tasks.total_revenue column in the Gold layer is calculated using unit_price and quantity from the Silver layer. This is essential for tracking Sensitive Information (PII) and auditing complex calculations.The following diagram illustrates a standard lineage flow within a Medallion architecture, showing how a specific ingestion job links raw files to structured tables.
A directed graph representing the flow of data from raw source files through processing jobs to refined tables.
Implementing lineage requires a mechanism to capture these relationships. You cannot rely on manual diagrams as they become obsolete the moment code changes. There are two primary technical approaches to automated collection.
This method involves parsing the SQL or code that defines the transformation before it runs. If you use a tool like dbt (data build tool) or execute pure SQL against Trino or Spark SQL, the system parses the Abstract Syntax Tree (AST) of the query.
Consider the following SQL statement:
INSERT INTO silver.sales_summary
SELECT customer_id, SUM(amount)
FROM bronze.transactions
GROUP BY customer_id
A parser analyzes this string and identifies:
bronze.transactionssilver.sales_summaryamountStatic analysis is fast and provides lineage even if the job fails. However, it struggles with imperative code (like Python or Scala scripts in Spark) where the input/output paths might be determined dynamically at runtime.
This approach captures lineage metadata while the job is executing. This is the preferred method for data lakes running Spark, Flink, or Airflow.
In a Spark environment, you attach a SparkListener to the application context. As the Spark driver plans the execution graph, it resolves the exact physical paths of the input files and the destination directory. The listener captures this "Logical Plan" and emits a lineage event to a central backend.
This method is highly accurate because it records what actually happened, not just what the code intended to do. It captures the specific partition read and the exact number of rows written.
To prevent vendor lock-in, the industry has converged on OpenLineage, an open standard for lineage collection. OpenLineage defines a JSON schema for defining the "Run" (the execution), the "Job" (the definition), and the "Datasets" (inputs/outputs).
When you implement lineage in your data lake, you typically configure your compute engines (Spark, Trino) to emit OpenLineage events to a compatible backend (like Marquez or DataHub).
Below is a simplified representation of an OpenLineage event that a Spark job might emit upon completion.
{
"eventType": "COMPLETE",
"eventTime": "2023-10-27T10:00:00.000Z",
"run": {
"runId": "d46d8fba-23a8-4f9d-9b23-...",
"facets": {
"parent": { "job": { "name": "daily_etl_workflow" } }
}
},
"job": {
"namespace": "production_analytics",
"name": "silver_processing_job"
},
"inputs": [{
"namespace": "s3://data-lake-bucket",
"name": "bronze/orders",
"facets": {
"schema": { "fields": [ { "name": "order_id", "type": "STRING" } ] }
}
}],
"outputs": [{
"namespace": "s3://data-lake-bucket",
"name": "silver/orders_cleaned",
"facets": {
"schema": { "fields": [ { "name": "order_id", "type": "STRING" } ] }
}
}]
}
This JSON structure allows the metadata catalog to stitch together independent job runs into a cohesive graph. The runId links the specific execution instance, while the inputs and outputs arrays provide the connecting edges of the graph.
Once lineage events are collected, they must be stored in a way that supports graph traversals. Traditional relational databases are often inefficient for querying deep hierarchical relationships (e.g., finding all upstream ancestors of a node 10 levels deep).
Therefore, lineage backends often utilize Graph Databases (like Neo4j or Amazon Neptune) or specialized relational schemas optimized for recursive queries.
When you view the "Lineage" tab in a catalog tool like AWS Glue Data Catalog or DataHub, the backend executes a traversal query. For a given node , it finds all nodes where a path exists (upstream lineage) or (downstream lineage).
Lineage implementation is directly tied to the governance discussions from earlier in this chapter. By integrating lineage with Role-Based Access Control (RBAC), you can enforce propagation policies.
For example, if a Bronze column is tagged as PII:True, lineage allows the governance engine to automatically tag any downstream Silver or Gold column derived from it as PII:True. This ensures that sensitive data does not accidentally "leak" into a widely accessible reporting table simply because a data engineer renamed the column during a transformation.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with