Distributed systems lack a global clock reliable enough to coordinate consistent snapshots across hundreds of operators simultaneously. A naive approach to fault tolerance might involve a "stop" mechanism where all processing halts, state is flushed to disk, and processing resumes. In high-throughput streaming environments, this introduces unacceptable latency spikes and significantly degrades performance.Apache Flink addresses this challenge using Asynchronous Barrier Snapshots (ABS). This mechanism allows the system to create a consistent image of the global state without halting continuous data processing. ABS is a variation of the Chandy-Lamport distributed snapshot algorithm, tailored specifically for the directed acyclic graph (DAG) topology of a streaming job.The Role of Checkpoint BarriersThe core component of the ABS algorithm is the checkpoint barrier. Unlike standard data records, barriers are control messages injected into the data stream by the source operators. These barriers divide the unbounded stream into logical epochs. All records preceding barrier $n$ belong to the snapshot $n$, while records following the barrier belong to snapshot $n+1$.Barriers flow downstream alongside data records, strictly preserving the order of events (FIFO) within each channel. Because barriers occupy a specific position in the stream, they trigger state persistence exactly when all modifications related to the pre-barrier data are complete.Diagram of the checkpoint barrier propagation through a simple operator topology. The barrier separates data belonging to the current checkpoint epoch from the next.digraph G { rankdir=TB; node [shape=box, style=filled, fontname="Helvetica", color="#dee2e6", fillcolor="#f8f9fa"]; edge [fontname="Helvetica", fontsize=10]; subgraph cluster_0 { label = "Source Operator"; style=filled; color="#e9ecef"; node [fillcolor="#ffffff"]; src [label="Source"]; } subgraph cluster_1 { label = "Transformation"; style=filled; color="#e9ecef"; node [fillcolor="#ffffff"]; op1 [label="Map Function"]; op2 [label="Window Operator"]; } subgraph cluster_2 { label = "Sink"; style=filled; color="#e9ecef"; node [fillcolor="#ffffff"]; sink [label="Sink"]; } src -> op1 [label="Stream Data", color="#4dabf7"]; op1 -> op2 [label="Barrier N", color="#fa5252", style=dashed, penwidth=2]; op2 -> sink [label="Processed Data", color="#4dabf7"]; node [shape=plaintext, fillcolor=none]; legend [label=<<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0"><TR><TD BGCOLOR="#fa5252">Barrier</TD><TD BGCOLOR="#4dabf7">Data Stream</TD></TR></TABLE>>]; }Barrier Alignment and Exactly-Once SemanticsWhen an operator receives data from multiple input channels (for example, a CoProcessFunction or a KeyedWindow), it must ensure that the snapshot reflects a consistent view across all inputs. This necessitates a process known as barrier alignment.The alignment process functions as follows for an operator with inputs $I_1$ and $I_2$:Arrival: The operator receives barrier $n$ from input channel $I_1$.Blocking: The operator stops processing data from $I_1$. It cannot process any data from epoch $n+1$ on this channel until the snapshot is taken, otherwise, the state would mix data from two different epochs.Buffering: While $I_1$ is blocked, the operator continues to process data from input $I_2$.Completion: Once barrier $n$ arrives from $I_2$, both inputs are aligned.Snapshotting: The operator emits the barrier to its downstream outputs and triggers the asynchronous snapshot of its local state.Resumption: The operator unblocks $I_1$ and resumes processing data from both channels for epoch $n+1$.The alignment phase guarantees exactly-once processing semantics. If the system failed and restored from this snapshot, every record before the barrier was processed exactly once, and no record after the barrier was processed prematurely.Mathematically, let $R_{i}$ be the sequence of records received on channel $i$. The state $S$ at checkpoint $C_n$ satisfies:$$ S(C_n) = f({r \in R_{i} \mid \text{index}(r) < \text{index}(\text{Barrier}_n) \forall i}) $$Where $f$ represents the state transition function. This implies that the state reflects strictly the consumption of the prefix of the streams up to the barrier.The Trade-off: Latency and BackpressureBarrier alignment introduces a latency trade-off. During the alignment phase, the operator effectively pauses consumption of the faster stream ($I_1$ in the previous example) to wait for the slower stream ($I_2$). In topologies with high data skew or significant network jitter, this waiting period manifests as backpressure that propagates upstream.If the alignment time takes $t_{align}$, the total end-to-end latency for records in the blocked channel increases by at least $t_{align}$. For applications requiring ultra-low latency where occasional duplicate processing is acceptable upon recovery, Flink allows configuring Unaligned Checkpoints. In this mode, barriers overtake data records in the buffers, and the inflight data is persisted as part of the checkpoint state, bypassing the alignment delay at the cost of larger checkpoint sizes.Asynchronous State SnapshottingOnce barriers are aligned (or immediately upon arrival for single-input operators), the operator performs the state snapshot. To prevent blocking the main processing thread, Flink separates the snapshot process into two phases:Synchronous Phase: The operator performs a lightweight, local copy of the state. For heap-based backends, this might be a copy-on-write operation on the hash table. For RocksDB, this involves forcing a flush of the MemTable to disk and creating a hard link to the SSTables. This phase stops processing, but typically completes in milliseconds.Asynchronous Phase: A background thread transfers the state files to durable remote storage (such as S3, HDFS, or GCS). The main stream processing thread resumes immediately after the synchronous phase is complete.This separation ensures that the cost of transferring large state sets over the network does not penalize the throughput of the streaming job.Visualization of the state size growth over time relative to barrier intervals. Efficient snapshotting minimizes the "Stop-Time" (synchronous phase).{"layout": {"title": {"text": "Snapshot Timing: Sync vs Async Phases", "font": {"family": "Helvetica", "size": 16, "color": "#495057"}}, "xaxis": {"title": {"text": "Time (ms)", "font": {"family": "Helvetica", "color": "#868e96"}}, "showgrid": true, "gridcolor": "#dee2e6"}, "yaxis": {"title": {"text": "Operation", "font": {"family": "Helvetica", "color": "#868e96"}}, "showgrid": false, "zeroline": false, "showticklabels": false}, "plot_bgcolor": "#ffffff", "paper_bgcolor": "#ffffff", "margin": {"l": 50, "r": 50, "t": 50, "b": 50}, "height": 300, "shapes": [{"type": "rect", "x0": 100, "y0": 0, "x1": 150, "y1": 1, "fillcolor": "#fa5252", "line": {"width": 0}, "opacity": 0.7}, {"type": "rect", "x0": 150, "y0": 0, "x1": 500, "y1": 1, "fillcolor": "#228be6", "line": {"width": 0}, "opacity": 0.7}], "annotations": [{"x": 125, "y": 0.5, "text": "Sync", "font": {"color": "white", "family": "Helvetica"}, "showarrow": false}, {"x": 325, "y": 0.5, "text": "Async Upload (Background)", "font": {"color": "white", "family": "Helvetica"}, "showarrow": false}]}, "data": []}Distributed CoordinationThe checkpoint coordinator, running on the JobManager, orchestrates the entire lifecycle. It triggers the sources to inject barriers and collects acknowledgments from all tasks. A checkpoint is considered complete only when all operators in the DAG have successfully acknowledged their state handle for that specific barrier ID.If an operator fails to acknowledge the snapshot (due to network failure or disk I/O errors), the coordinator marks checkpoint $n$ as failed and garbage collects any partial data. The stream processing continues uninterrupted, relying on the next successful checkpoint interval to establish a valid recovery point. This optimistic strategy ensures that transient infrastructure failures do not derail the real-time pipeline, provided that the interval between successful checkpoints remains within the limits defined by the service level agreement (SLA) for data loss recovery.