As state size grows into the terabytes, the cost of persisting data becomes the primary bottleneck in stream processing. Flink captures consistent state without halting processing by using asynchronous barrier snapshots. However, naively copying the entire state to remote storage for every checkpoint, known as a full checkpoint, is unsustainable for large-scale applications. It consumes excessive network bandwidth and extends the time required to complete a checkpoint, increasing the risk of SLA violations.To address this, Flink utilizes incremental checkpointing. Instead of persisting the full state $S$ at every interval, the system only persists the changes $\Delta S$ that occurred since the last successful checkpoint. This approach dramatically reduces the duration of the checkpointing phase and the load on the distributed file system, enabling more frequent checkpoints and tighter recovery point objectives (RPO).The Role of RocksDB and LSM TreesIncremental checkpointing in Flink is tightly coupled with the storage format of the state backend. Currently, this feature is primarily supported by the RocksDB state backend. To understand how Flink isolates state deltas, one must understand the underlying Log-Structured Merge-tree (LSM tree) architecture used by RocksDB.RocksDB works by buffering writes in memory (MemTable). When this buffer fills, it is flushed to disk as a Sorted String Table (SSTable). These SSTables are immutable files. Once written, an SSTable is never modified; updates or deletes are written to new SSTables. Background compaction processes eventually merge these files to reclaim space and remove overwritten data.Flink uses this immutability. Since existing SSTables do not change, a checkpoint does not need to re-upload files that were already persisted in a previous checkpoint. An incremental checkpoint consists only of:New SSTables generated by RocksDB flushes since the last checkpoint.A Handle Registry (Manifest) that references both the new files and the existing files from previous checkpoints that are still valid.The following diagram illustrates the relationship between RocksDB's local file generation and Flink's shared state registry on remote storage (like S3 or HDFS).digraph G { rankdir=TB; node [shape=box, style=filled, fontname="Helvetica", fontsize=10, margin=0.2]; edge [fontname="Helvetica", fontsize=9, color="#868e96"]; subgraph cluster_local { label="Local RocksDB Instance"; style=dashed; color="#adb5bd"; fontcolor="#495057"; mem [label="Active MemTable", fillcolor="#eebefa", color="#ae3ec9"]; sst1 [label="SSTable_1.sst\n(Immutable)", fillcolor="#d0bfff", color="#7048e8"]; sst2 [label="SSTable_2.sst\n(Immutable)", fillcolor="#d0bfff", color="#7048e8"]; sst3 [label="SSTable_3.sst\n(New)", fillcolor="#bac8ff", color="#4263eb"]; mem -> sst3 [label="Flush", style=dotted]; } subgraph cluster_remote { label="Distributed Storage (e.g., S3)"; style=dashed; color="#adb5bd"; fontcolor="#495057"; shared_sst1 [label="SSTable_1.sst\n(Persisted CP1)", fillcolor="#ced4da", color="#868e96"]; shared_sst2 [label="SSTable_2.sst\n(Persisted CP1)", fillcolor="#ced4da", color="#868e96"]; shared_sst3 [label="SSTable_3.sst\n(Persisted CP2)", fillcolor="#a5d8ff", color="#1c7ed6"]; } sst1 -> shared_sst1 [label="Pointer Only", style=dashed, color="#fa5252"]; sst2 -> shared_sst2 [label="Pointer Only", style=dashed, color="#fa5252"]; sst3 -> shared_sst3 [label="Physical Upload", color="#228be6", penwidth=2]; }The diagram demonstrates that unchanged files (SSTable 1 and 2) are referenced rather than re-uploaded, while only the new file (SSTable 3) is physically transferred during the second checkpoint.Checkpoint Lifecycle and Reference CountingWhen incremental checkpointing is enabled, the lifecycle of a checkpointed file differs significantly from full checkpointing. Flink does not own the files in the same way; instead, it manages a reference count for every SSTable uploaded to the distributed file system.Materialization: When a checkpoint triggers, RocksDB flushes its MemTable to disk. Flink identifies all SSTables currently used by the RocksDB instance.Deduplication: Flink compares the list of local SSTables against the list of known shared files. If a file handle exists in the shared registry, Flink registers a new reference to it rather than uploading the bytes.Registration: New SSTables are uploaded. The checkpoint meta-data (the manifest) records the complete list of file handles required to reconstruct the state.Garbage Collection: When an old checkpoint is discarded (for example, when it falls outside the retained checkpoint limit), Flink decreases the reference counts for the files associated with that checkpoint. A file is physically deleted from the distributed storage only when its reference count drops to zero.This mechanism implies that the size of the checkpoint metadata grows slightly, as it tracks more files, but the data transfer volume drops to match the churn rate of the application state.$$ Size_{transfer} \approx \sum_{f \in NewFiles} Size(f) $$Compaction and Write AmplificationA critical performance consideration in this architecture is the interaction between RocksDB's compaction and Flink's checkpoints. RocksDB merges multiple smaller SSTables into larger ones to optimize read performance and remove deleted keys.When compaction occurs, the old input SSTables are deleted locally and replaced by a new, merged SSTable. Consequently, Flink detects this new merged file as "new data," even though it contains existing records. This leads to a phenomenon where the checkpoint transfer size may spike following a major compaction event, regardless of the actual ingress data rate.For high-throughput systems, it is often necessary to tune RocksDB's compaction style (e.g., Level vs. Universal compaction) to balance read amplification against the network bandwidth spikes caused by these checkpoint uploads.Trade-offs: Recovery Time versus Snapshot TimeWhile incremental checkpoints optimize the runtime performance (snapshotting), they introduce a trade-off during the recovery phase (restoring).Snapshot Phase: Fast. Only deltas are written.Restore Phase: Potentially slower. To rebuild the local RocksDB instance, Flink may need to download data from multiple historical checkpoints. The state is fragmented across many small SSTables stored in the distributed file system.However, once the files are downloaded, RocksDB starts quickly because it simply loads the file handles. The overhead is primarily network latency and throughput when gathering the fragmented files. In practice, the impact on recovery time is often negligible compared to the massive gains in checkpoint stability, but operators should be aware of this characteristic when managing systems with slow distributed storage.Implementation and ConfigurationEnabling incremental checkpointing is a straightforward configuration change, but it must be applied explicitly as the default behavior depends on the Flink version and distribution.In flink-conf.yaml:state.backend: rocksdb state.backend.incremental: trueOr programmatically within the application code:StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); EmbeddedRocksDBStateBackend backend = new EmbeddedRocksDBStateBackend(true); env.setStateBackend(backend);When monitoring a pipeline with incremental checkpoints enabled, standard metrics can be misleading. The metric lastCheckpointSize reports the size of the data physically transferred, which will be small. To understand the total size of the state being managed, you must monitor lastCheckpointFullSize, which reports the logical size of the complete state if it were a full checkpoint.A growing divergence between lastCheckpointSize and lastCheckpointFullSize confirms that the incremental mechanism is functioning correctly, saving network bandwidth and storage I/O.