While batch ingestion provides a reliable method for moving historical snapshots, it often fails to meet the latency requirements of modern analytical applications. When a business needs to react to inventory changes, fraud alerts, or customer interactions in minutes rather than days, waiting for a nightly bulk load is insufficient. Furthermore, repeated full-table snapshots become prohibitively expensive as data volumes grow. To address these limitations, data engineers utilize Change Data Capture (CDC).CDC is a design pattern that identifies and tracks changes to data in a source system so that these changes can be applied to a downstream repository. In the context of a data lake, CDC transforms the database integration strategy from "copying the state" to "streaming the events."Approaches to Change DetectionThere are two primary mechanisms for implementing CDC: query-based polling and log-based extraction. Understanding the difference is critical for designing scalable pipelines.Query-based CDC relies on the application layer. It requires the source table to have a column that tracks the last modification time (such as updated_at) or an auto-incrementing ID. The ingestion pipeline periodically runs a SQL query to fetch records where the tracking column is greater than the high-water mark from the previous run.$$ \text{SELECT } * \text{ FROM } \text{orders} \text{ WHERE } \text{updated_at} > \text{'2023-10-27 10:00:00'} $$This approach is easy to implement but has significant drawbacks:Missed Deletes: If a record is hard-deleted from the source, the SQL query will simply not return it. The target data lake retains the stale record indefinitely.Performance Impact: Frequent polling places load on the source database's query engine.Data Consistency: It cannot capture intermediate states. If a record is updated three times between polling intervals, the pipeline only captures the final state, losing the history of transitions.Log-based CDC, widely considered the standard for production data lakes, interacts directly with the database transaction log (e.g., Write-Ahead Log in PostgreSQL, Binlog in MySQL, or Redo Log in Oracle). Every database commits transactions to a log file before acknowledging the write. Log-based CDC tools act as a client reading this log stream.This method captures every event (INSERT, UPDATE, DELETE) in the exact order they occurred. It places minimal load on the source database because it reads the file system rather than executing SQL queries.digraph G { rankdir=LR; node [style=filled, fontname="Helvetica", shape=box, color="#dee2e6"]; edge [color="#adb5bd"]; subgraph cluster_0 { style=filled; color="#f8f9fa"; label="Source Database"; DB [label="Database Engine", fillcolor="#4dabf7", fontcolor="white"]; Log [label="Transaction Log\n(WAL/Binlog)", fillcolor="#e64980", fontcolor="white"]; DB -> Log [label="Writes"]; } subgraph cluster_1 { style=filled; color="#f8f9fa"; label="CDC Connector"; Debezium [label="Log Reader\n(e.g., Debezium)", fillcolor="#7950f2", fontcolor="white"]; } Log -> Debezium [label="Streams Events"]; Stream [label="Message Bus\n(Kafka/Kinesis)", fillcolor="#fab005"]; Lake [label="Data Lake\n(Bronze Layer)", fillcolor="#12b886", fontcolor="white"]; Debezium -> Stream; Stream -> Lake; }Log-based architectures decouple the extraction process from the database query engine, allowing for real-time event streaming without performance degradation.The Anatomy of a CDC EventIn a log-based pipeline, the data moving through your system is no longer a simple row; it is an envelope containing the data and the operation metadata. A standard CDC message typically contains:Operation Type: Indicates if the change was a create (c), update (u), or delete (d).Timestamp: When the transaction occurred.Before Image: The state of the row before the change (important for updates).After Image: The state of the row after the change.This structure allows the data lake to reconstruct the database state at any point in time. In the Medallion architecture, these raw CDC events land directly in the Bronze layer. We do not attempt to merge or deduplicate them immediately. The Bronze layer serves as an immutable history log of every change received from the source.Propagating Changes to the Silver LayerOnce raw events are secured in the Bronze layer (often stored as JSON or Avro files), the engineering challenge shifts to applying these changes to the Silver layer tables. This is where Open Table Formats like Apache Iceberg or Delta Lake become essential.In a standard file system (like raw Parquet on S3), you cannot update a specific row. You would have to rewrite the entire file. Table formats enable ACID transactions on the data lake, allowing distinct MERGE operations.The standard pattern for processing CDC feeds into a Silver table involves the following logic:Deduplication: In a distributed system, events may arrive out of order or be duplicated. We must order events by the transaction timestamp and keep only the most recent change for a specific primary key within the current micro-batch.Merge Operation: We utilize the MERGE INTO SQL command. This command joins the incoming batch of events with the target table on the primary key.When Matched and Operation is DELETE: Delete the row from the target.When Matched and Operation is UPDATE: Update the target columns.When Not Matched and Operation is INSERT: Insert the new row.This process effectively synchronizes the data lake with the source database.Handling Hard DeletesHandling deletions is the most significant advantage of log-based CDC over batch snapshots. When a row is deleted in the source, the log emits a delete event containing the primary key of the removed record.In the data lake, there are two strategies for handling this event:Hard Delete: Physically removing the row from the Silver table. This maintains an exact mirror of the source but prevents historical analysis of deleted items.Soft Delete: Adding a metadata column is_deleted to the Silver table. Instead of removing the record, the pipeline updates this flag to true.Soft deletes are generally preferred in data engineering because they preserve history. Analysts can filter out deleted records for current state reporting ($$ \text{WHERE is_deleted} = \text{false} $$) while retaining the ability to audit when and why records were removed.The Re-snapshotting ChallengeWhile CDC effectively captures incremental changes, a pipeline purely based on logs typically starts empty. To initialize the data lake, a "historical load" or "snapshot" is required.A production pattern involves a hybrid approach:Snapshot Phase: A one-time batch export of the entire source table is executed.Log Phase: The CDC connector is configured to start reading the log from the specific position (Log Sequence Number or Offset) that corresponds to the moment the snapshot finished.This ensures zero data loss and prevents gaps between the historical dump and the ongoing stream. Tools like Debezium handle this coordination automatically, transitioning from snapshot mode to streaming mode without manual intervention.The choice between Batch and CDC ultimately depends on the volatility of the data and the freshness requirements of the consumer. However, as data lakes increasingly serve as the backend for operational ML models and near-real-time dashboards, the architectural shift toward log-based CDC is becoming the standard implementation.