Reliability in distributed streaming systems is often misunderstood as a networking guarantee. For systems like Kafka and Flink, processing semantics do not define how reliable the network transport is, but rather how the system manages state mutation in the presence of failures. When a stream processor crashes and restarts, the correctness of the application depends on how it accounts for the events processed immediately prior to the failure.We model a stateful stream processor as a function $\mathcal{F}$ that takes the current state $S_t$ and an input event $E_t$ to produce a new state $S_{t+1}$ and potentially an output $O_t$:$$ (S_{t+1}, O_t) = \mathcal{F}(S_t, E_t) $$The challenge arises when the system fails after updating $S_{t+1}$ but before acknowledging $E_t$ to the source, or after emitting $O_t$ but before snapshotting $S_{t+1}$. The recovery mechanism determines whether $E_t$ is replayed and how that replay affects $S$.At-Least-Once and IdempotencyAt-least-once processing is the default reliability mode for many high-throughput pipelines. It guarantees that every event $E_t$ will be processed by function $\mathcal{F}$ potentially multiple times, but never zero times.This is achieved through offset management. The consumer reads an event, processes it, and only then commits the offset to the broker. If the worker fails processing $E_t$ but before the commit, the consumer group rebalance will assign the partition to a new worker. The new worker resumes from the last committed offset, re-reading $E_t$.While this ensures no data loss, it introduces the duplicate update problem. If $\mathcal{F}$ involves a non-idempotent operation, such as incrementing a counter ($S = S + 1$), replaying $E_t$ results in an corrupted state ($S = S + 2$).For at-least-once to be sufficient, the application logic must be idempotent. Mathematically, an operation is idempotent if applying it multiple times yields the same result as applying it once:$$ \mathcal{F}(S, E) = \mathcal{F}(\mathcal{F}(S, E), E) $$This is trivial for set operations (e.g., adding a unique ID to a HyperLogLog) but difficult for aggregation sequences or transactional bank transfers.Exactly-Once Semantics (EOS)Exactly-once semantics (EOS) does not imply that an event is physically processed exactly once. In a distributed environment with network partitions, physical retries are unavoidable. Instead, EOS guarantees that the effect of processing the event on the state and output is applied exactly once.Flink achieves this internally using the Chandy-Lamport algorithm variant for distributed snapshots. The system injects "barrier" markers into the data stream. These barriers flow with the records, logically dividing the stream into a pre-snapshot part and a post-snapshot part.When an operator receives a barrier, it performs the following:Aligns inputs (if reading from multiple partitions), waiting for the barrier to arrive on all channels.Snapshots its local state (e.g., RocksDB SST files) to durable storage.Forwards the barrier to downstream operators.This alignment ensures that the global state snapshot represents a consistent cut of the stream execution at a specific logical time.digraph G { rankdir=LR; bgcolor="#ffffff"; node [shape=box, style=filled, fontname="Helvetica", fontsize=10]; edge [fontname="Helvetica", fontsize=9, color="#868e96"]; subgraph cluster_0 { label = "Source Operator"; style = dashed; color = "#dee2e6"; fontcolor = "#495057"; src [label="Kafka Source\nPartition 1", fillcolor="#a5d8ff", color="#1c7ed6"]; } subgraph cluster_1 { label = "Processing Operator"; style = dashed; color = "#dee2e6"; fontcolor = "#495057"; op1 [label="Map\n(State Update)", fillcolor="#b2f2bb", color="#37b24d"]; } subgraph cluster_2 { label = "State Backend"; style = dashed; color = "#dee2e6"; fontcolor = "#495057"; state [label="Checkpoint\nStorage", shape=cylinder, fillcolor="#ffec99", color="#f59f00"]; } stream1 [shape=plaintext, label="Stream: [E1, E2, Barrier, E3]"]; src -> op1 [label="Emits Events"]; op1 -> state [label="On Barrier:\nSnapshot State", style=bold, color="#f03e3e"]; op1 -> op1 [label="Buffer E3\nduring alignment"]; }Checkpoint barriers flow through the pipeline, triggering state persistence at each operator to ensure a globally consistent view without halting the stream.End-to-End Consistency and Two-Phase CommitWhile internal state consistency is handled by checkpoints, ensuring exactly-once delivery to external systems (like a Kafka sink) requires coordination between the stream processor and the output system. This is implemented using a Two-Phase Commit (2PC) protocol.In a Flink-to-Kafka pipeline, the Flink KafkaSink functions as a transactional producer.Phase 1: Pre-Commit: As the Flink application processes events between checkpoints, it writes data to a Kafka topic within a transaction. These messages are sent to the broker but are marked as "uncommitted." Consumers configured with isolation.level=read_committed will not see these messages yet.Phase 2: Commit: When the Flink JobManager completes a global checkpoint, it notifies the KafkaSink. The sink then commits the transaction to Kafka.This mechanism ties the visibility of the output to the durability of the internal state. If the Flink job fails before the checkpoint completes, the Kafka transaction times out or is aborted, and the messages are never effectively "seen" by downstream consumers. Upon restart, Flink rewinds to the last checkpoint and replays the events, generating a new transaction.Latency and Performance ImplicationsAdopting exactly-once semantics introduces a trade-off between correctness and latency. In the transactional model, data written to Kafka is not visible to downstream consumers until the checkpoint completes.If your checkpoint interval is set to 10 seconds, the minimum latency for downstream consumers is 10 seconds, regardless of how fast the network is. The consumer must wait for the "Commit" marker to read the buffered messages.This relationship creates a tuning requirement:Lower Checkpoint Interval: Reduces end-to-end latency and data loss risk, but increases CPU overhead and I/O pressure on the state backend.Higher Checkpoint Interval: Improves throughput by batching larger transactions, but increases latency and recovery time.{"layout": {"title": {"text": "Latency vs. Throughput in Transactional Processing", "font": {"family": "Helvetica", "size": 16, "color": "#495057"}}, "xaxis": {"title": {"text": "Checkpoint Interval (ms)", "font": {"size": 12, "color": "#868e96"}}, "showgrid": true, "gridcolor": "#e9ecef"}, "yaxis": {"title": {"text": "Relative Performance", "font": {"size": 12, "color": "#868e96"}}, "showgrid": true, "gridcolor": "#e9ecef"}, "plot_bgcolor": "#ffffff", "paper_bgcolor": "#ffffff", "legend": {"x": 0.7, "y": 0.1}, "margin": {"l": 50, "r": 30, "t": 50, "b": 50}}, "data": [{"type": "scatter", "mode": "lines+markers", "name": "Throughput Overhead", "x": [100, 500, 1000, 5000, 10000], "y": [85, 40, 20, 5, 2], "line": {"color": "#fa5252", "width": 3}}, {"type": "scatter", "mode": "lines+markers", "name": "End-to-End Latency", "x": [100, 500, 1000, 5000, 10000], "y": [150, 600, 1100, 5100, 10100], "line": {"color": "#228be6", "width": 3}}]}Reducing the checkpoint interval minimizes latency (blue) but drastically increases the overhead on the system (red), potentially degrading maximum throughput.Event Time versus Processing TimeCorrectness in distributed streams also depends heavily on the definition of time. When we discuss guarantees, we must distinguish between:Processing Time ($t_p$): The system clock time of the machine executing the operation.Event Time ($t_e$): The timestamp attached to the record itself, representing when the event actually occurred.If a pipeline uses processing time windowing (e.g., "count events received between 10:00 and 10:05"), the results are non-deterministic. Replaying a log after a failure happens at a different $t_p$, causing events to fall into different windows than they did originally.To achieve true exactly-once correctness in analytical applications, you must use event time semantics. This decouples the results from the speed of the processor. Even if the system goes down for an hour and then processes the backlog at 10x speed, the aggregation windows based on $t_e$ remain identical.However, using $t_e$ introduces the problem of late data. Since network delays and distributed clocks vary, $t_e$ does not monotonically increase in the stream. To handle this, Flink uses Watermarks, control flow packets similar to checkpoint barriers, that flow through the stream and assert that "no events with timestamp $t < T$ will arrive henceforth."The interaction between Watermarks, Checkpoints (state consistency), and Transactions (output consistency) forms the foundation of modern stream processing.