Designing a Kappa architecture requires a fundamental shift in how you view data retention and processing logic. A traditional Lambda architecture maintains two separate codebases: one for low-latency speed (stream) and one for high-accuracy completeness (batch). Kappa architecture establishes the log as the single source of truth. The stream processing engine handles both real-time ingestion and historical reprocessing.This practical section outlines the design of a clickstream analysis pipeline. The objective is to calculate user session duration and cart abandonment rates in real-time while retaining the ability to recompute historical metrics when business definitions change. We will construct this using Apache Kafka as the immutable log and Apache Flink for unified computation.The Immutable Log ConfigurationThe foundation of this architecture is the retention policy of the Kafka topics. In a Kappa design, the raw input topic must serve as the canonical dataset. Unlike transient message queues where data is deleted after consumption, this topic effectively replaces the Data Lake for the purpose of raw event storage.For our clickstream topic raw-clickstream-events, standard retention policies based on size or time are insufficient if they expire data before it might need reprocessing. You must leverage Kafka's Tiered Storage or configure long retention periods backed by sufficient disk space.Configuration requirements for the input topic:retention.ms: Set to -1 (infinite) or a duration covering your compliance audit window (e.g., 7 years).replication.factor: Minimum of 3 to ensure durability.min.insync.replicas: Set to 2 to enforce strong consistency during ingestion.digraph KappaFlow { rankdir=LR; node [shape=box, style="filled,rounded", fontname="Helvetica", penwidth=0]; edge [fontname="Helvetica", color="#868e96"]; subgraph cluster_0 { label = "Ingestion Layer"; style=filled; color="#f8f9fa"; Producer [label="Clickstream API", fillcolor="#4dabf7", fontcolor="white"]; } subgraph cluster_1 { label = "Storage Layer (The Log)"; style=filled; color="#f8f9fa"; Kafka [label="Kafka Topic:\nraw-clickstream-events\n(Infinite Retention)", fillcolor="#fab005", fontcolor="black"]; } subgraph cluster_2 { label = "Compute Layer (Flink)"; style=filled; color="#f8f9fa"; FlinkRT [label="Job v1:\nReal-time Aggregation", fillcolor="#69db7c", fontcolor="white"]; FlinkReplay [label="Job v2:\nHistorical Reprocess", fillcolor="#20c997", fontcolor="white", style="dashed,filled"]; } subgraph cluster_3 { label = "Serving Layer"; style=filled; color="#f8f9fa"; Sink [label="Analytical DB\n(e.g., ClickHouse)", fillcolor="#e64980", fontcolor="white"]; } Producer -> Kafka [label="Protobuf Events"]; Kafka -> FlinkRT [label="Group ID: live-v1"]; Kafka -> FlinkReplay [label="Group ID: replay-v2"]; FlinkRT -> Sink [label="Upsert"]; FlinkReplay -> Sink [label="Upsert (Correction)"]; }Data flow in a Kappa architecture showing parallel execution of real-time and reprocessing jobs against the same immutable log.Schema Definition and EvolutionSince the log is permanent, the data structure will inevitably evolve. A schema registry is required to manage compatibility. For high-throughput pipelines, Protocol Buffers (Protobuf) or Avro are preferred over JSON due to compact serialization and strict typing.We define our event schema to include an event_timestamp. This field drives the event time processing in Flink, distinct from the Kafka ingestion timestamp.syntax = "proto3"; message ClickEvent { string user_id = 1; string session_id = 2; string url = 3; string event_type = 4; // e.g., "view", "add_to_cart", "checkout" int64 event_timestamp = 5; // Epoch milliseconds map<string, string> properties = 6; }When evolving this schema, you must ensure backward compatibility. New fields must be optional or have default values so that the Flink job can read historical data (written with Schema V1) using the new logic (compiled with Schema V2).Flink Job Structure and Time SemanticsThe Flink job processes the unbounded stream. To support the "batch" capability of Kappa (reprocessing), the job must be deterministic. This relies heavily on correct Watermark strategies.When reprocessing one year of data, the processing time $t_p$ will progress extremely fast, while the event time $t_e$ tracks the historical timestamps. If your windowing logic relies on $t_p$ (e.g., ProcessingTimeSessionWindows), your historical results will differ from your real-time results. You must use EventTimeSessionWindows.Consider the logic for detecting a "session." A session closes after a gap of inactivity.$$ \text{SessionGap} > 30 \text{ minutes} $$In the code, we define the source and explicitly assign timestamps and watermarks to handle out-of-order data, which is common in distributed collection.DataStream<ClickEvent> stream = env .fromSource(kafkaSource, WatermarkStrategy .forBoundedOutOfOrderness(Duration.ofMinutes(1)) .withTimestampAssigner((event, timestamp) -> event.getEventTimestamp()), "Kafka Source"); stream .keyBy(ClickEvent::getUserId) .window(EventTimeSessionWindows.withGap(Time.minutes(30))) .process(new SessionAnalyticsFunction()) .sinkTo(databaseSink);Implementing Idempotency for ReprocessingThe defining feature of Kappa is the ability to improve logic and re-run over history. Suppose you deploy the pipeline with a 30-minute session gap. Later, data science analysis suggests a 15-minute gap provides better features for the prediction model.To implement this change without a separate batch system:Update the Logic: Modify the EventTimeSessionWindows.withGap parameter to 15 minutes.New Consumer Group: Configure the Flink job with a new Kafka Consumer Group ID (e.g., analytics-v2).Start from Earliest: Set the starting offset strategy to Earliest.When the new job starts, it reads from the beginning of the topic. This creates a potential issue: the downstream database will receive updates for old records.To handle this, the Sink must be idempotent.If the sink is a key-value store or a database supporting upserts (like PostgreSQL INSERT ON CONFLICT or Elasticsearch _update), the replay will overwrite previous values correctly. If the system requires an append-only sink (like writing to S3), you must direct the reprocessed output to a new location (e.g., s3://bucket/analytics/v2/) to avoid corrupting the existing dataset.Handling Throughput During ReplayA common challenge in Kappa architectures is the "catch-up" phase. When replaying months of data, the ingestion rate into Flink will be orders of magnitude higher than real-time traffic.You may observe backpressure if the Flink operators or the Sink cannot handle this surge. To mitigate this during the design phase:Parallelism tuning: Ensure the Kafka topic has enough partitions (e.g., 64 or 128) to allow high parallelism during replay, even if real-time traffic only requires a parallelism of 4.State Backend: Use RocksDB as the state backend. During replay, massive amounts of state (open windows) will be generated. The heap-based backend will likely encounter OutOfMemory errors.Sink buffering: Increase batch sizes for database commits in the sink configuration to optimize for high-throughput writes rather than low-latency availability.The following chart illustrates the resource utilization profile during a replay event versus steady-state processing.{ "layout": { "title": "Resource Utilization: Steady State vs. Historical Replay", "xaxis": { "title": "Timeline (Hours)", "showgrid": true, "gridcolor": "#e9ecef" }, "yaxis": { "title": "Throughput (Records/Sec)", "showgrid": true, "gridcolor": "#e9ecef" }, "plot_bgcolor": "white", "showlegend": true, "legend": { "x": 0.7, "y": 1 } }, "data": [ { "x": [0, 1, 2, 3, 4, 5, 6, 7, 8], "y": [5000, 5200, 4800, 5100, 5000, 5300, 5100, 5000, 5200], "type": "scatter", "mode": "lines", "name": "Steady State (Live)", "line": {"color": "#228be6", "width": 3} }, { "x": [0, 1, 2, 3, 4], "y": [150000, 145000, 152000, 148000, 0], "type": "scatter", "mode": "lines", "name": "Replay Job (Catch-up)", "line": {"color": "#fa5252", "width": 3, "dash": "dot"} } ] }Throughput comparison showing the massive spike required for the replay job to process historical data, eventually dropping to zero once it catches up or is terminated.By configuring infinite retention on the log, enforcing strict schema evolution, and designing deterministic event-time logic in Flink, you eliminate the need for a separate batch layer. The stream processor becomes the unified engine for both past and future data.