Handling Schema Evolution

Data pipelines operate in a dynamic environment where upstream application teams frequently modify database structures to support new product features. These modifications range from adding fields to renaming columns or changing data types. If a data ingestion pipeline assumes a static schema, any deviation in the source structure can cause job failures or data corruption. Schema evolution refers to the ability of a data lake system to accommodate changes in data structure over time without requiring a full rewrite of existing data.

In a traditional relational database management system (RDBMS), schema changes are handled via DDL commands like ALTER TABLE. The database engine manages the metadata and physical storage synchronously. In a data lake, storage is decoupled from compute and metadata. You might have terabytes of historical data stored in immutable Parquet files with schema version A, while incoming data arrives with schema version B. The challenge lies in unifying these versions into a single logical table that query engines can read consistently.

Schema-on-Read vs. Schema-on-Write

Early Hadoop architectures relied heavily on schema-on-read. In this model, data is landed in its raw format without validation. The schema is applied only when the data is queried. While this ensures ingestion never fails due to structure mismatches, it pushes the complexity of handling errors to the query layer. If a column type changes from integer to string, the query engine may return nulls or throw runtime errors.

Modern data lake architectures typically employ schema-on-write or schema enforcement for the Silver and Gold layers. This approach validates data against a defined schema before writing to storage. If the data does not match, the system must decide whether to reject the record or evolve the target schema.

Types of Schema Changes

When designing ingestion pipelines, it is important to categorize changes based on their impact on downstream consumers.

Additive Changes (Backward Compatible): Adding a new column is generally safe. Historical files simply lack this column. When a query engine reads older files, it fills the missing values with NULL. This requires the file format (like Parquet) and the query engine to support schema merging.
Destructive Changes (Breaking): Dropping a column or renaming a field can break existing queries that reference that column. In a data lake, "dropping" a column usually means it is no longer written to new files, but the data remains in old files.
Type Changes (Potentially Breaking): changing a column from int to long is usually safe (upcasting). Changing from string to int is unsafe as existing data may not fit the new type.

Managing Evolution with Open Table Formats

Raw file formats like Parquet store the schema in the file footer. However, they do not inherently know about the schema of other files in the same directory. Open table formats like Apache Iceberg and Delta Lake introduce a metadata layer that manages schema evolution more robustly than raw files.

Name-Based vs. ID-Based Resolution

A critical distinction in handling schema evolution is how columns are identified.

Name-based resolution relies on the column name. If you rename a column from user_id to customer_id, the system interprets this as dropping user_id and adding a new column customer_id with all null values for historical data. This is common in Spark and raw Parquet workflows.

ID-based resolution assigns a unique integer ID to every column. Apache Iceberg uses this method. If user_id (ID: 1) is renamed to customer_id, the ID remains 1. The metadata simply updates the display name. The underlying data is correctly linked regardless of the name change, allowing for safe renaming without rewriting files.

The diagram illustrates how a query engine utilizes a metadata catalog to unify different file versions. File A lacks the "region" column, while File B includes it. The catalog presents a unified schema to the user.

Implementation Strategies

When implementing ingestion jobs using frameworks like Apache Spark, you must explicitly configure how schema mismatches are handled.

1. Automatic Schema Merging

In batch processing, you can enable schema merging. When the writer detects a new column in the source dataframe that does not exist in the target, it updates the target metadata to include the new column.

For example, in Delta Lake, this is controlled via the mergeSchema option:

# Example configuration (pseudo-code)
(source_dataframe.write
  .format("delta")
  .option("mergeSchema", "true")
  .mode("append")
  .save("/path/to/table"))

This allows additive changes to propagate automatically. However, it poses a risk: if an upstream system accidentally sends garbage columns, your data lake table will become polluted with hundreds of useless columns.

2. Schema Evolution in Streaming

Streaming pipelines present a unique challenge because they run continuously. Restarting a stream to handle a schema change can lead to downtime.

Some engines support schema evolution in streaming by checking the schema of each micro-batch. If a change is detected and is compatible (like a new column), the job updates the metadata and continues. If the change is incompatible (like a type mismatch), the stream typically fails to prevent data corruption.

3. The Dead Letter Queue (DLQ) Pattern

For strict environments where schema changes must be manually approved, the DLQ pattern is preferred.

The ingestion job compares the incoming schema against the registered catalog schema.
If they match, write the data.
If they do not match, write the failing records to a separate "Dead Letter" location (e.g., a specific S3 bucket folder).
Alert the data engineering team.
The team reviews the DLQ. If the schema change is valid, they manually alter the target table and backfill the data from the DLQ.

Mathematical Representation of Schema Union

When a query engine reads a dataset spanning multiple partitions with different schemas, it performs a schema union operation.

Let $S_t$ be the schema of the table at time $t$ . Let $S_{file}$ be the schema of a specific file within the table.

For a query to be valid over a set of files $F = \{f_1, f_2, \dots, f_n\}$ , the query engine constructs a read schema $S_{read}$ such that:

$S_{read} = \bigcup_{i=1}^{n} S_{f_i}$

When reading a specific file $f_i$ , for every column $c \in S_{read}$ :

If $c \in S_{f_i}$ , return the value from the file.
If $c \notin S_{f_i}$ , return NULL.

This logic allows the data lake to present a coherent view of the data despite underlying physical fragmentation.

Handling Partition Column Evolution

A specific edge case involves changes to partition columns. If a table is partitioned by date, and the business logic changes to partition by date and region, this is a significant physical layout change.

Most table formats do not support rewriting existing partitioning layouts on the fly. The standard approach involves:

Evolution: New data is written using the new partition scheme.
Metadata Abstraction: The query engine is smart enough to prune files based on the old scheme for old data and the new scheme for new data.

Apache Iceberg excels here through a feature called Hidden Partitioning. It allows partition transforms (like bucketing or truncating) to evolve without rewriting the table. The split planning phase of the query execution handles the mapping between the logical query filter and the physical partition layout.

Was this section helpful?

References

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, Martin Kleppmann, 2017 (O'Reilly Media) - Chapter 4, 'Data Encoding and Evolution,' provides a discussion on schema evolution, backward and forward compatibility, and different data formats in distributed systems.
Apache Iceberg: A Table Format for Analytic Datasets, Ryan Blue, Daniel Weeks, Parosh Jasani, Justin S. Johnson, Andrew Gallagher, Jason D. Reid, Brock Noland, Michael Yoder, Steven Hamm, 2020 Proceedings of the VLDB Endowment (PVLDB), Vol. 13 (VLDB Endowment) DOI: 10.14778/3407914.3407920 - This paper introduces Apache Iceberg, detailing its architecture and how it addresses challenges like schema evolution and hidden partitioning in large-scale data lakes.
Delta Lake: High-Performance ACID Table Storage for Big Data, Michael Armbrust, Ali Ghodsi, Andrew Liu, Xiangrui Meng, Joseph Bradley, Burak Yavuz, Jeffrey H. Reback, Xiang Zhang, Evan Zhou, 2020 Proceedings of the VLDB Endowment (PVLDB), Vol. 13 (VLDB Endowment) DOI: 10.14778/3400735.3400742 - Presents Delta Lake, an open-source storage layer that brings ACID transactions and schema enforcement to data lakes, aiding schema evolution.
Structured Streaming Programming Guide, The Apache Software Foundation, 2024 - The sections on managing schema evolution and handling incompatible changes in streaming data pipelines provide practical implementation details for Apache Spark.