State Schema Evolution

Long-running streaming applications inevitably face a challenge that batch processes rarely encounter: business logic and data structures change while the application continues to run. Batch processes might simply re-process the raw input data with new code. The 'truth' in a stateful streaming architecture is often held in the intermediate state stored in RocksDB or on the heap. When deploying a new version of a Flink application, it must be able to read the binary state written by the previous version. If the class definitions of state objects have modified, the binary data may no longer match the deserialization logic, leading to critical failures.

This interoperability between past checkpoints and present code is managed through state schema evolution. Flink creates a bridge between the schema used to write the data and the schema used to read it, but this bridge relies on specific serialization frameworks and strict compatibility rules.

The Role of Serializer Snapshots

Flink does not merely store raw data bytes in a checkpoint. Alongside the state values, Flink persists the configuration of the serializer used to write that data. This metadata is encapsulated in a TypeSerializerSnapshot. When a job restarts from a savepoint, Flink checks the compatibility between the TypeSerializerSnapshot stored in the savepoint and the TypeSerializer configured in the new application code.

The compatibility check results in one of several outcomes:

Compatible as is: The binary format is identical. No migration needed.
Compatible with reconfiguration: The new serializer can read the old format if configured correctly (for example, updated Avro schema with default values).
Compatible with migration: The state is accessible but requires a one-time transformation to the new format before processing continues.
Incompatible: The new serializer cannot read the old data. The job fails to start.

The following diagram illustrates the decision flow Flink executes during a restart operation to ensure state integrity.

Compatibility decision tree executed during a Flink job restart.

Evolving POJO Types

Flink's native POJO serializer supports a moderate degree of schema evolution. Unlike Java native serialization, which is brittle and heavily reliant on serialVersionUID, the PojoSerializer inspects the structure of the class. It allows for:

Removing Fields: If a field exists in the old state but not in the new class, Flink effectively skips over the bytes corresponding to that field during deserialization.
Adding Fields: If a new field is added to the class, Flink will initialize it with the default value for that type (e.g., 0 for integers, null for objects) when reading old state.
Changing Field Types: This is generally not supported and will result in incompatibility.

For this evolution to work, the POJO rules must be strictly followed. The class must be public, have a public no-argument constructor, and all fields must be either public or accessible via standard getters and setters. If Flink falls back to the generic Kryo serializer because the class does not adhere to POJO specifications, schema evolution capabilities are significantly reduced, often requiring a complete state reset.

Avro and Protobuf Evolution

For production environments managing terabytes of state, relying on implicit POJO evolution is risky. Explicit schema management using Avro or Protobuf is the standard for state evolution. These formats decouple the data definition from the Java class implementation.

When using Avro with Flink, the AvroSerializer can handle schema changes by leveraging Avro's built-in resolution logic. This requires that the new schema is backward compatible with the old schema.

To ensure backward compatibility when modifying an Avro schema:

Adding Fields: You must provide a default value. When the new reader encounters a record written by the old writer (which lacks the field), it populates the field with the default.
Removing Fields: The new reader simply ignores the field present in the old data.
Renaming Fields: This is typically treated as removing the old field and adding a new one with a default value, losing the data in the process, unless aliases are correctly configured.

For example, a user profile state object. We can define the relationship between the old schema $S_{old}$ and the new schema $S_{new}$ . If we add a field email_verified to $S_{new}$ , the deserialization process for an old record $R_{old}$ can be expressed as:

$R_{new} = \text{deserialize}(R_{old}, S_{new}) = \{ \text{fields}(R_{old}) \} \cup \{ \text{default}(\text{email\_verified}) \}$

If the default value is missing in the definition of $S_{new}$ , the compatibility check fails because $R_{new}$ cannot be constructed deterministically.

Custom Serialization with TypeSerializerSnapshot

For complex custom types or when high-performance manual serialization is used, you must implement the TypeSerializerSnapshot interface. This interface acts as the source of truth for schema versioning.

When implementing a custom serializer, you define a snapshot class that writes the configuration parameters of your serializer (not the data) to the checkpoint. Upon restore, resolveSchemaCompatibility is called.

Here is the logic flow you must implement within resolveSchemaCompatibility:

Read Parameters: Load the configuration stored in the snapshot (e.g., version number, compression flags).
Check Current Config: Compare these with the current serializer's configuration.
Determine Action: Return SchemaCompatibility.compatibleAsIs() if nothing changed, or SchemaCompatibility.compatibleAfterMigration() if the format is different but convertible.

If you return compatibleAfterMigration(), Flink will trigger a background process to rewrite the state from the old format to the new format. This process reads every key-value pair using the old serializer and writes it back using the new serializer. While this ensures correctness, it causes a longer restore time proportional to the size of the state.

State Processor API for Incompatible Changes

There are scenarios where schemas diverge so significantly that standard evolution rules cannot resolve the difference. For example, changing a state variable from a List<String> to a Map<String, Integer>. In these cases, the TypeSerializer will report incompatibility, and the job will refuse to start.

To handle this, you use the State Processor API. This is an offline tool that allows you to read a savepoint as a dataset, transform it using standard Flink batch operators (Map, FlatMap), and write a new savepoint.

The workflow involves:

Loading the existing savepoint using SavepointReader.
Reading the specific state operator data into a DataSet.
Applying a transformation function to map the old data structure to the new POJO or Avro definition.
Writing a new savepoint using SavepointWriter with the updated state.

This approach effectively performs an "ETL on State," allowing you to perform arbitrary complex migrations, cleanups, or schema refactoring without losing the historical context required by the streaming application.

Performance Implications of Schema Evolution

While schema evolution provides flexibility, it introduces overhead. Using the Avro serializer is generally slower than using a specialized POJO serializer or a custom binary serializer because of the overhead of schema resolution and object instantiation.

The chart below compares the relative throughput impact when utilizing different serialization strategies during a state-heavy windowing operation. Note that while Avro incurs a serialization penalty, it offers the highest safety for long-term schema evolution.

Comparison of serialization throughput (bars) against safety for schema evolution (red diamonds).

Choosing the right strategy requires balancing the raw performance needs of millisecond-latency loops against the operational requirement to update applications over months or years. For most enterprise pipelines, the overhead of Avro is a necessary cost to prevent data loss during application upgrades.

Was this section helpful?

References

State Schema Evolution, The Apache Flink Community, 2024 (Apache Software Foundation) - Official guide to Flink's state schema evolution, including serializer snapshots, POJO rules, Avro integration, and the State Processor API.
Apache Avro Specification - Schema Resolution, The Apache Avro Project, 2023 (The Apache Software Foundation) - Details how Avro handles schema evolution and compatibility, which is essential for Flink applications using Avro serializers.
Stream Processing with Apache Flink: Fundamentals, StreamSQL, and Table API, Fabian Hueske, Vasiliki Kalavri, 2019 (O'Reilly Media) - A comprehensive guide on Flink, with a section on state management and best practices for evolving state.