Designing a Kappa architecture requires a fundamental shift in how you view data retention and processing logic. A traditional Lambda architecture maintains two separate codebases: one for low-latency speed (stream) and one for high-accuracy completeness (batch). Kappa architecture establishes the log as the single source of truth. The stream processing engine handles both real-time ingestion and historical reprocessing.
This practical section outlines the design of a clickstream analysis pipeline. The objective is to calculate user session duration and cart abandonment rates in real-time while retaining the ability to recompute historical metrics when business definitions change. We will construct this using Apache Kafka as the immutable log and Apache Flink for unified computation.
The foundation of this architecture is the retention policy of the Kafka topics. In a Kappa design, the raw input topic must serve as the canonical dataset. Unlike transient message queues where data is deleted after consumption, this topic effectively replaces the Data Lake for the purpose of raw event storage.
For our clickstream topic raw-clickstream-events, standard retention policies based on size or time are insufficient if they expire data before it might need reprocessing. You must leverage Kafka's Tiered Storage or configure long retention periods backed by sufficient disk space.
Configuration requirements for the input topic:
-1 (infinite) or a duration covering your compliance audit window (e.g., 7 years).Data flow in a Kappa architecture showing parallel execution of real-time and reprocessing jobs against the same immutable log.
Since the log is permanent, the data structure will inevitably evolve. A schema registry is required to manage compatibility. For high-throughput pipelines, Protocol Buffers (Protobuf) or Avro are preferred over JSON due to compact serialization and strict typing.
We define our event schema to include an event_timestamp. This field drives the event time processing in Flink, distinct from the Kafka ingestion timestamp.
syntax = "proto3";
message ClickEvent {
string user_id = 1;
string session_id = 2;
string url = 3;
string event_type = 4; // e.g., "view", "add_to_cart", "checkout"
int64 event_timestamp = 5; // Epoch milliseconds
map<string, string> properties = 6;
}
When evolving this schema, you must ensure backward compatibility. New fields must be optional or have default values so that the Flink job can read historical data (written with Schema V1) using the new logic (compiled with Schema V2).
The Flink job processes the unbounded stream. To support the "batch" capability of Kappa (reprocessing), the job must be deterministic. This relies heavily on correct Watermark strategies.
When reprocessing one year of data, the processing time will progress extremely fast, while the event time tracks the historical timestamps. If your windowing logic relies on (e.g., ProcessingTimeSessionWindows), your historical results will differ from your real-time results. You must use EventTimeSessionWindows.
Consider the logic for detecting a "session." A session closes after a gap of inactivity.
In the code, we define the source and explicitly assign timestamps and watermarks to handle out-of-order data, which is common in distributed collection.
DataStream<ClickEvent> stream = env
.fromSource(kafkaSource, WatermarkStrategy
.forBoundedOutOfOrderness(Duration.ofMinutes(1))
.withTimestampAssigner((event, timestamp) -> event.getEventTimestamp()),
"Kafka Source");
stream
.keyBy(ClickEvent::getUserId)
.window(EventTimeSessionWindows.withGap(Time.minutes(30)))
.process(new SessionAnalyticsFunction())
.sinkTo(databaseSink);
The defining feature of Kappa is the ability to improve logic and re-run over history. Suppose you deploy the pipeline with a 30-minute session gap. Later, data science analysis suggests a 15-minute gap provides better features for the prediction model.
To implement this change without a separate batch system:
EventTimeSessionWindows.withGap parameter to 15 minutes.analytics-v2).Earliest.When the new job starts, it reads from the beginning of the topic. This creates a potential issue: the downstream database will receive updates for old records.
To handle this, the Sink must be idempotent.
If the sink is a key-value store or a database supporting upserts (like PostgreSQL INSERT ON CONFLICT or Elasticsearch _update), the replay will overwrite previous values correctly. If the system requires an append-only sink (like writing to S3), you must direct the reprocessed output to a new location (e.g., s3://bucket/analytics/v2/) to avoid corrupting the existing dataset.
A common challenge in Kappa architectures is the "catch-up" phase. When replaying months of data, the ingestion rate into Flink will be orders of magnitude higher than real-time traffic.
You may observe backpressure if the Flink operators or the Sink cannot handle this surge. To mitigate this during the design phase:
The following chart illustrates the resource utilization profile during a replay event versus steady-state processing.
Throughput comparison showing the massive spike required for the replay job to process historical data, eventually dropping to zero once it catches up or is terminated.
By configuring infinite retention on the log, enforcing strict schema evolution, and designing deterministic event-time logic in Flink, you eliminate the need for a separate batch layer. The stream processor becomes the unified engine for both past and future data.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with