While batch ingestion provides a reliable method for moving historical snapshots, it often fails to meet the latency requirements of modern analytical applications. When a business needs to react to inventory changes, fraud alerts, or customer interactions in minutes rather than days, waiting for a nightly bulk load is insufficient. Furthermore, repeated full-table snapshots become prohibitively expensive as data volumes grow. To address these limitations, data engineers utilize Change Data Capture (CDC).
CDC is a design pattern that identifies and tracks changes to data in a source system so that these changes can be applied to a downstream repository. In the context of a data lake, CDC transforms the database integration strategy from "copying the state" to "streaming the events."
There are two primary mechanisms for implementing CDC: query-based polling and log-based extraction. Understanding the difference is critical for designing scalable pipelines.
Query-based CDC relies on the application layer. It requires the source table to have a column that tracks the last modification time (such as updated_at) or an auto-incrementing ID. The ingestion pipeline periodically runs a SQL query to fetch records where the tracking column is greater than the high-water mark from the previous run.
This approach is easy to implement but has significant drawbacks:
Log-based CDC, widely considered the standard for production data lakes, interacts directly with the database transaction log (e.g., Write-Ahead Log in PostgreSQL, Binlog in MySQL, or Redo Log in Oracle). Every database commits transactions to a log file before acknowledging the write. Log-based CDC tools act as a client reading this log stream.
This method captures every event (INSERT, UPDATE, DELETE) in the exact order they occurred. It places minimal load on the source database because it reads the file system rather than executing SQL queries.
Log-based architectures decouple the extraction process from the database query engine, allowing for real-time event streaming without performance degradation.
In a log-based pipeline, the data moving through your system is no longer a simple row; it is an envelope containing the data and the operation metadata. A standard CDC message typically contains:
This structure allows the data lake to reconstruct the database state at any point in time. In the Medallion architecture, these raw CDC events land directly in the Bronze layer. We do not attempt to merge or deduplicate them immediately. The Bronze layer serves as an immutable history log of every change received from the source.
Once raw events are secured in the Bronze layer (often stored as JSON or Avro files), the engineering challenge shifts to applying these changes to the Silver layer tables. This is where Open Table Formats like Apache Iceberg or Delta Lake become essential.
In a standard file system (like raw Parquet on S3), you cannot update a specific row. You would have to rewrite the entire file. Table formats enable ACID transactions on the data lake, allowing distinct MERGE operations.
The standard pattern for processing CDC feeds into a Silver table involves the following logic:
MERGE INTO SQL command. This command joins the incoming batch of events with the target table on the primary key.
This process effectively synchronizes the data lake with the source database.
Handling deletions is the most significant advantage of log-based CDC over batch snapshots. When a row is deleted in the source, the log emits a delete event containing the primary key of the removed record.
In the data lake, there are two strategies for handling this event:
is_deleted to the Silver table. Instead of removing the record, the pipeline updates this flag to true.Soft deletes are generally preferred in data engineering because they preserve history. Analysts can filter out deleted records for current state reporting () while retaining the ability to audit when and why records were removed.
While CDC effectively captures incremental changes, a pipeline purely based on logs typically starts empty. To initialize the data lake, a "historical load" or "snapshot" is required.
A production pattern involves a hybrid approach:
This ensures zero data loss and prevents gaps between the historical dump and the ongoing stream. Tools like Debezium handle this coordination automatically, transitioning from snapshot mode to streaming mode without manual intervention.
The choice between Batch and CDC ultimately depends on the volatility of the data and the freshness requirements of the consumer. However, as data lakes increasingly serve as the backend for operational ML models and near-real-time dashboards, the architectural shift toward log-based CDC is becoming the standard implementation.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with