Chapter 1: Stream Processing Architectures and Semantics

Constructing resilient streaming applications requires a clear understanding of how data moves and persists across a distributed environment. Unlike batch processing, where the dataset is finite, stream processing addresses unbounded datasets that arrive continuously and often out of order. This section establishes the architectural principles necessary for designing production-grade pipelines with Apache Kafka and Flink.

We begin by examining the distributed log. This append-only data structure forms the basis of Kafka's storage engine and enables the decoupling of producers from consumers. You will analyze how the log supports sequential access patterns and data replay to handle system failures.

Following the storage layer, we compare two architectural patterns: Lambda and Kappa. While Lambda architectures maintain separate layers for batch and stream processing, Kappa architectures simplify the stack by treating all data as a stream. We will discuss how Flink's consistency mechanisms allow the Kappa architecture to replace complex Lambda deployments.

The chapter concludes with a technical definition of processing guarantees and time. You will distinguish between at-most-once, at-least-once, and exactly-once delivery semantics. Additionally, we will address the problem of time skew by differentiating event time $t_e$ from processing time $t_p$ . Understanding these definitions is required to correctly handle late data and ensure deterministic results in your applications.

Sections

1.1 Evolution of Distributed Logs
1.2 Lambda versus Kappa Architecture
1.3 Processing Guarantees and Semantics
1.4 Event Time versus Processing Time
1.5 Hands-on Practical: Designing a Kappa Pipeline