Designing a data lake often involves balancing two conflicting requirements: the need for low-latency updates and the need for comprehensive, accurate historical analysis. While the Medallion architecture provides a logical organization for data quality (Bronze, Silver, Gold), it does not explicitly dictate the timing or mechanics of how data moves through these stages. This is where processing architectures come into play.
The two primary patterns for structuring data processing pipelines are the Lambda architecture and the Kappa architecture. These patterns define how batch processing and stream processing interact to provide views of your data.
The Lambda architecture was designed to handle massive quantities of data by taking advantage of both batch-processing and stream-processing methods. It attempts to balance latency, throughput, and fault tolerance by using a hybrid approach.
The core philosophy of Lambda architecture is based on the equation:
However, computing a function over "All Data" in real-time is computationally expensive and often infeasible. To solve this, Lambda splits the workload into three distinct layers:
Data enters the system and splits into two paths. The hot path (Speed Layer) provides immediate results, while the cold path (Batch Layer) ensures long-term accuracy and correction.
The primary benefit of the Lambda architecture is fault tolerance. If the speed layer produces erroneous results due to a bug or late-arriving data, the batch layer eventually overwrites them with correct, reconciled data during the next cycle. This provides a "self-healing" capability.
However, the operational cost is high. You effectively maintain two separate codebases: one for the streaming system (e.g., Apache Flink or Spark Streaming) and one for the batch system (e.g., standard Apache Spark or dbt). Keeping the business logic synchronized between these two distinct paradigms is a common source of engineering errors.
The Kappa architecture arose as a reaction to the complexity of maintaining two parallel pipelines in Lambda. It posits that you do not need a distinct batch layer if your stream processing system is sufficiently strong.
In Kappa, everything is a stream. Batch processing is simply treated as a stream processing job with a bounded dataset (start and end points), whereas real-time processing is a stream job with an unbounded dataset.
The architecture consists of two main components:
If you need to recompute data (a requirement handled by the Batch layer in Lambda), you simply replay the stream from the beginning of the log using the same code logic, effectively reprocessing history.
The unified pipeline handles both real-time ingestion and historical reprocessing. Replays occur by resetting the offset in the distributed log.
Kappa significantly simplifies the infrastructure by unifying the codebase. Developers write transformation logic once. However, it introduces specific challenges regarding data retention. Because the "Batch" capability relies on replaying the stream, the underlying log storage must be capable of retaining potentially petabytes of history, or you must implement a tiering strategy where older log segments are offloaded to object storage (the data lake) but remain replayable.
Selecting the right pattern depends on your latency requirements and the complexity of your transformations.
The Lambda architecture remains relevant in scenarios where:
The Kappa architecture is increasingly preferred for modern data platforms because:
Modern open table formats like Delta Lake and Apache Iceberg have enabled a variation often called "Kappa Plus" or "Architecture Unified."
In this model, the data lake itself acts as the streaming sink and source. Because these table formats support ACID transactions and efficient upserts, you can stream data directly into your Bronze tables. Downstream Silver and Gold tables can then process this data in micro-batches or continuous streams. This allows the object store (S3/ADLS) to serve as the "infinite retention log," solving the retention limitations of systems like Kafka while maintaining the single-pipeline simplicity of Kappa.
By decoupling compute from storage and using transactional table formats, you can achieve the low latency of streams with the scalability of the batch layer, effectively merging the benefits of both architectures.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with