Data pipelines operate in a dynamic environment where upstream application teams frequently modify database structures to support new product features. These modifications range from adding fields to renaming columns or changing data types. If a data ingestion pipeline assumes a static schema, any deviation in the source structure can cause job failures or data corruption. Schema evolution refers to the ability of a data lake system to accommodate changes in data structure over time without requiring a full rewrite of existing data.
In a traditional relational database management system (RDBMS), schema changes are handled via DDL commands like ALTER TABLE. The database engine manages the metadata and physical storage synchronously. In a data lake, storage is decoupled from compute and metadata. You might have terabytes of historical data stored in immutable Parquet files with schema version A, while incoming data arrives with schema version B. The challenge lies in unifying these versions into a single logical table that query engines can read consistently.
Early Hadoop architectures relied heavily on schema-on-read. In this model, data is landed in its raw format without validation. The schema is applied only when the data is queried. While this ensures ingestion never fails due to structure mismatches, it pushes the complexity of handling errors to the query layer. If a column type changes from integer to string, the query engine may return nulls or throw runtime errors.
Modern data lake architectures typically employ schema-on-write or schema enforcement for the Silver and Gold layers. This approach validates data against a defined schema before writing to storage. If the data does not match, the system must decide whether to reject the record or evolve the target schema.
When designing ingestion pipelines, it is important to categorize changes based on their impact on downstream consumers.
NULL. This requires the file format (like Parquet) and the query engine to support schema merging.int to long is usually safe (upcasting). Changing from string to int is unsafe as existing data may not fit the new type.Raw file formats like Parquet store the schema in the file footer. However, they do not inherently know about the schema of other files in the same directory. Open table formats like Apache Iceberg and Delta Lake introduce a metadata layer that manages schema evolution more robustly than raw files.
A critical distinction in handling schema evolution is how columns are identified.
Name-based resolution relies on the column name. If you rename a column from user_id to customer_id, the system interprets this as dropping user_id and adding a new column customer_id with all null values for historical data. This is common in Spark and raw Parquet workflows.
ID-based resolution assigns a unique integer ID to every column. Apache Iceberg uses this method. If user_id (ID: 1) is renamed to customer_id, the ID remains 1. The metadata simply updates the display name. The underlying data is correctly linked regardless of the name change, allowing for safe renaming without rewriting files.
The diagram illustrates how a query engine utilizes a metadata catalog to unify different file versions. File A lacks the "region" column, while File B includes it. The catalog presents a unified schema to the user.
When implementing ingestion jobs using frameworks like Apache Spark, you must explicitly configure how schema mismatches are handled.
In batch processing, you can enable schema merging. When the writer detects a new column in the source dataframe that does not exist in the target, it updates the target metadata to include the new column.
For example, in Delta Lake, this is controlled via the mergeSchema option:
# Example configuration (pseudo-code)
(source_dataframe.write
.format("delta")
.option("mergeSchema", "true")
.mode("append")
.save("/path/to/table"))
This allows additive changes to propagate automatically. However, it poses a risk: if an upstream system accidentally sends garbage columns, your data lake table will become polluted with hundreds of useless columns.
Streaming pipelines present a unique challenge because they run continuously. Restarting a stream to handle a schema change can lead to downtime.
Some engines support schema evolution in streaming by checking the schema of each micro-batch. If a change is detected and is compatible (like a new column), the job updates the metadata and continues. If the change is incompatible (like a type mismatch), the stream typically fails to prevent data corruption.
For strict environments where schema changes must be manually approved, the DLQ pattern is preferred.
When a query engine reads a dataset spanning multiple partitions with different schemas, it performs a schema union operation.
Let be the schema of the table at time . Let be the schema of a specific file within the table.
For a query to be valid over a set of files , the query engine constructs a read schema such that:
When reading a specific file , for every column :
NULL.This logic allows the data lake to present a coherent view of the data despite underlying physical fragmentation.
A specific edge case involves changes to partition columns. If a table is partitioned by date, and the business logic changes to partition by date and region, this is a significant physical layout change.
Most table formats do not support rewriting existing partitioning layouts on the fly. The standard approach involves:
Apache Iceberg excels here through a feature called Hidden Partitioning. It allows partition transforms (like bucketing or truncating) to evolve without rewriting the table. The split planning phase of the query execution handles the mapping between the logical query filter and the physical partition layout.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•