Distributed systems inherently face a difficult trade-off between consistency and availability. Apache Kafka's default configuration allows a producer to retry sending messages when network jitter or temporary broker failures occur. If the original message was successfully persisted but the acknowledgement was lost on the wire, this retry mechanism results in duplicate records in the log. This behavior satisfies "at-least-once" semantics. However, for financial ledgers, inventory management, or any stateful aggregation where correctness is critical, duplicates introduce state corruption.
To bridge the gap between reliable delivery and data correctness, Kafka introduces two distinct but related capabilities: Idempotent Producers and Transactional API. These features elevate the system from simple message passing to providing "exactly-once" processing semantics (EOS).
Idempotence ensures that a specific message is written to a specific partition exactly once, regardless of how many times the producer sends it. This is a local guarantee, scoped to a single producer session and a single target partition. It protects against network-induced duplicates but does not handle cross-partition atomicity or application crashes.
When enable.idempotence=true is configured, the producer undergoes an initialization protocol. Upon startup, it requests a Producer ID (PID) from the broker. This PID is transparent to the user and is reset if the application restarts.
Along with the PID, the producer maintains a monotonically increasing sequence number () for each partition it writes to. Every batch of messages sent to the broker includes the PID and the base sequence number. The broker maintains a map of the last committed sequence number () for every PID-partition pair in memory and in the .snapshot files.
The broker validates incoming requests using specific logic. For a new request with sequence number :
OutOfOrderSequenceException.This mechanism effectively deduplicates messages at the broker level with negligible performance overhead. It does not require a two-phase commit or coordination with other services.
The sequence number validation logic prevents duplicate persistence when network retries occur.
While idempotence solves duplication within a single stream, complex pipelines often require atomicity across multiple components. A common pattern is "consume-process-produce," where an application consumes from source Topic A, applies logic, and writes results to sink Topic B.
If the application crashes after writing to Topic B but before committing the offset for Topic A, the system enters an inconsistent state. Upon restart, the consumer reads the message from Topic A again (since the offset was not committed) and produces a duplicate to Topic B. Idempotence cannot solve this because the new producer instance has a new PID.
Kafka Transactions allow an application to write to multiple partitions (and commit consumer offsets) atomically. Either all writes succeed and become visible, or all are discarded.
Transactions introduce a server-side module called the Transaction Coordinator. This component manages the state of transactions. The state is persisted in an internal topic named __transaction_state. This internal topic functions similarly to the __consumer_offsets topic but uses a specialized binary format for transaction metadata.
To use transactions, the producer must configure a static transactional.id. Unlike the ephemeral PID, the transactional.id persists across application restarts. This persistence enables the coordinator to identify resurrected instances of the same application.
The write path involves a sophisticated coordination protocol.
transactional.id (determined by hashing the ID).transactional.id is still active, the coordinator fences it off by incrementing the producer epoch.PREPARE_COMMIT message to the __transaction_state log. Once this message is persisted, the transaction is guaranteed to complete.COMMITTED to the transaction log.The Transaction Coordinator manages the lifecycle, writing commit markers to data partitions only after the transaction state is persisted.
The implementation of transactions in Kafka relies on the fact that producers write data to the log immediately, regardless of the transaction status. An uncommitted message resides on the partition disk just like a committed one. The distinction is enforced at the consumer level via the isolation.level configuration.
In this mode, consumers read all messages in offset order. They will see messages from open transactions and even messages from aborted transactions. This mode offers lower latency but no consistency guarantees for transactional workflows.
When a consumer is configured with isolation.level=read_committed, it processes messages only up to the Last Stable Offset (LSO). The LSO is defined as the first offset of the earliest open transaction.
Messages appearing after the LSO are buffered in the consumer until the transaction is decided.
COMMIT marker is encountered, the buffered messages are released to the application.ABORT marker is encountered, the buffered messages are discarded, and the consumer advances past the marker.This buffering mechanism implies that read_committed consumers may experience higher latency if concurrent transactions are long-running. A stalled transaction holds back the LSO, preventing consumers from reading subsequent committed messages from other producers on the same partition.
In distributed environments, a common failure mode is the "Zombie Producer." This occurs when a producer process hangs (e.g., during a long garbage collection pause) or loses network connectivity. The cluster assumes the producer is dead, and a new instance starts. If the old producer wakes up and attempts to write, it could corrupt the data.
Kafka solves this using Epoch Fencing. When the new producer instance initializes with the transactional.id, the coordinator assigns it a higher epoch number. The coordinator (and all brokers involved) will subsequently reject any write requests carrying the old epoch. This fencing is handled automatically by the protocol, ensuring that only one valid writer exists for a transactional ID at any given moment.
Implementing transactions incurs a performance cost. The overhead comes from:
read_committed mode.__transaction_state log and marker replication.To optimize, producers should aim for larger transaction batches. Starting and committing a transaction for every single message severely degrades throughput due to the constant round-trips to the coordinator. Conversely, extremely large transactions increase the risk of blocking consumers due to LSO lag. A balanced approach involves tuning the transaction duration and size based on the specific latency requirements of the downstream consumers.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with