Long-running streaming applications inevitably face a challenge that batch processes rarely encounter: business logic and data structures change while the application continues to run. Batch processes might simply re-process the raw input data with new code. The 'truth' in a stateful streaming architecture is often held in the intermediate state stored in RocksDB or on the heap. When deploying a new version of a Flink application, it must be able to read the binary state written by the previous version. If the class definitions of state objects have modified, the binary data may no longer match the deserialization logic, leading to critical failures.
This interoperability between past checkpoints and present code is managed through state schema evolution. Flink creates a bridge between the schema used to write the data and the schema used to read it, but this bridge relies on specific serialization frameworks and strict compatibility rules.
Flink does not merely store raw data bytes in a checkpoint. Alongside the state values, Flink persists the configuration of the serializer used to write that data. This metadata is encapsulated in a TypeSerializerSnapshot. When a job restarts from a savepoint, Flink checks the compatibility between the TypeSerializerSnapshot stored in the savepoint and the TypeSerializer configured in the new application code.
The compatibility check results in one of several outcomes:
The following diagram illustrates the decision flow Flink executes during a restart operation to ensure state integrity.
Compatibility decision tree executed during a Flink job restart.
Flink's native POJO serializer supports a moderate degree of schema evolution. Unlike Java native serialization, which is brittle and heavily reliant on serialVersionUID, the PojoSerializer inspects the structure of the class. It allows for:
0 for integers, null for objects) when reading old state.For this evolution to work, the POJO rules must be strictly followed. The class must be public, have a public no-argument constructor, and all fields must be either public or accessible via standard getters and setters. If Flink falls back to the generic Kryo serializer because the class does not adhere to POJO specifications, schema evolution capabilities are significantly reduced, often requiring a complete state reset.
For production environments managing terabytes of state, relying on implicit POJO evolution is risky. Explicit schema management using Avro or Protobuf is the standard for state evolution. These formats decouple the data definition from the Java class implementation.
When using Avro with Flink, the AvroSerializer can handle schema changes by leveraging Avro's built-in resolution logic. This requires that the new schema is backward compatible with the old schema.
To ensure backward compatibility when modifying an Avro schema:
default value. When the new reader encounters a record written by the old writer (which lacks the field), it populates the field with the default.Consider a user profile state object. We can define the relationship between the old schema and the new schema . If we add a field email_verified to , the deserialization process for an old record can be expressed as:
If the default value is missing in the definition of , the compatibility check fails because cannot be constructed deterministically.
For complex custom types or when high-performance manual serialization is used, you must implement the TypeSerializerSnapshot interface. This interface acts as the source of truth for schema versioning.
When implementing a custom serializer, you define a snapshot class that writes the configuration parameters of your serializer (not the data) to the checkpoint. Upon restore, resolveSchemaCompatibility is called.
Here is the logic flow you must implement within resolveSchemaCompatibility:
SchemaCompatibility.compatibleAsIs() if nothing changed, or SchemaCompatibility.compatibleAfterMigration() if the format is different but convertible.If you return compatibleAfterMigration(), Flink will trigger a background process to rewrite the state from the old format to the new format. This process reads every key-value pair using the old serializer and writes it back using the new serializer. While this ensures correctness, it causes a longer restore time proportional to the size of the state.
There are scenarios where schemas diverge so significantly that standard evolution rules cannot resolve the difference. For example, changing a state variable from a List<String> to a Map<String, Integer>. In these cases, the TypeSerializer will report incompatibility, and the job will refuse to start.
To handle this, you use the State Processor API. This is an offline tool that allows you to read a savepoint as a dataset, transform it using standard Flink batch operators (Map, FlatMap), and write a new savepoint.
The workflow involves:
SavepointReader.DataSet.SavepointWriter with the updated state.This approach effectively performs an "ETL on State," allowing you to perform arbitrary complex migrations, cleanups, or schema refactoring without losing the historical context required by the streaming application.
While schema evolution provides flexibility, it introduces overhead. Using the Avro serializer is generally slower than using a specialized POJO serializer or a custom binary serializer because of the overhead of schema resolution and object instantiation.
The chart below compares the relative throughput impact when utilizing different serialization strategies during a state-heavy windowing operation. Note that while Avro incurs a serialization penalty, it offers the highest safety for long-term schema evolution.
Comparison of serialization throughput (bars) against safety for schema evolution (red diamonds).
Choosing the right strategy requires balancing the raw performance needs of millisecond-latency loops against the operational requirement to update applications over months or years. For most enterprise pipelines, the overhead of Avro is a necessary cost to prevent data loss during application upgrades.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•