Batch processing remains the primary mechanism for moving high volumes of historical and operational data into a data lake. While streaming architectures receive significant attention for their low-latency capabilities, batch workflows provide the throughput and reliability necessary for initializing datasets, performing nightly synchronizations, and ingesting data from sources that do not emit change events. A batch ingestion workflow involves extracting a discrete chunk of data from a source system, transporting it over the network, and persisting it into the storage layer (typically the Bronze or Raw layer) at scheduled intervals.Ingestion StrategiesThe design of a batch pipeline depends heavily on the nature of the source system and the business requirements for data freshness. We generally categorize these workflows into two primary patterns: Full Snapshots and Incremental Loads.Full SnapshotsIn a full snapshot approach, the pipeline extracts the entire dataset from the source table during every execution cycle and overwrites the target location in the data lake. This method ensures that the data in the lake matches the source exactly at the point of extraction, effectively handling hard deletes (records removed from the source) without complex logic.This strategy is effective for dimension tables or reference datasets with low volume (e.g., fewer than 10 million rows) where the operational overhead of tracking changes outweighs the cost of simply reloading the data. However, as data scales, this approach becomes computationally expensive and network-intensive.$$ Cost_{snapshot} \propto O(N) $$Where $N$ is the total size of the source dataset. As $N$ grows, the time to ingest grows linearly, eventually exceeding the available batch window.Incremental LoadingTo handle large fact tables or transaction logs, engineers employ incremental loading. This method extracts only the records that have been created or modified since the last successful execution. This requires a reliable monotonic tracking column in the source system, often referred to as a "high-water mark" or cursor. Common candidates include auto-incrementing primary keys or updated_at timestamps.The logic for an incremental extraction can be expressed as:$$ R_{delta} = { r \in R_{source} \mid r.watermark > t_{last_successful_run} } $$Implementing this requires state management. The orchestration engine (such as Airflow or Dagster) must persistently store the timestamp of the last successful extraction ($t_{last_successful_run}$) and pass it as a parameter to the subsequent job.Workflow logic for an incremental batch job using a high-water mark strategy.digraph G { rankdir=LR; node [shape=box, style=filled, fontname="Helvetica", fontsize=10]; subgraph cluster_0 { label = "Orchestration"; style=dashed; color="#adb5bd"; Scheduler [label="Job Scheduler", fillcolor="#e9ecef", color="#ced4da"]; StateStore [label="State Store\n(Last Watermark)", fillcolor="#fff5f5", color="#ffc9c9"]; } subgraph cluster_1 { label = "Execution"; style=dashed; color="#adb5bd"; Extractor [label="Extraction Process", fillcolor="#e7f5ff", color="#74c0fc"]; Writer [label="Parquet Writer", fillcolor="#e7f5ff", color="#74c0fc"]; } Source [label="Source Database", shape=cylinder, fillcolor="#f8f9fa", color="#adb5bd"]; DataLake [label="Data Lake\n(Bronze Layer)", shape=folder, fillcolor="#b2f2bb", color="#40c057"]; Scheduler -> StateStore [label="Get Last Run"]; StateStore -> Extractor [label="Pass Watermark"]; Source -> Extractor [label="SELECT * WHERE\nts > watermark"]; Extractor -> Writer [label="Memory Stream"]; Writer -> DataLake [label="Write Files"]; Writer -> Scheduler [label="On Success"]; Scheduler -> StateStore [label="Update Watermark"]; }Partitioning During IngestionA critical mistake in batch ingestion is writing data without a predefined directory structure. If a pipeline dumps thousands of files into a single S3 prefix or Azure Blob container, downstream query engines must scan the entire file list to find relevant data.To optimize for future retrieval, batch jobs should write data using Hive-style partitioning. This involves structuring the file path to include column names and values, typically based on the ingestion date or the event date.For example, instead of writing to: s3://my-lake/sales/batch_job_123.parquetThe pipeline should write to: s3://my-lake/sales/ingest_date=2023-10-27/part-001.parquetThis structure allows query engines to perform "partition pruning," ignoring folders that do not match the query predicates. When designing the batch workflow, the writer must be configured to dynamically determine the partition path based on the data content or the execution date.Managing Load Frequency and File SizeThere is a direct trade-off between data freshness (how often the batch runs) and storage efficiency. Running batch jobs too frequently results in the "small file problem," where the overhead of opening and closing many small files degrades query performance. Running them too infrequently results in stale data.For a standard batch workflow, the objective is to produce files that are large enough to be efficient for columnar readers (ideally between 128MB and 1GB) but frequent enough to meet Service Level Agreements (SLAs).The following chart illustrates the efficiency divergence between full snapshots and incremental loads as the dataset grows over time. While snapshots are simpler to implement, their resource consumption makes them unsustainable for main transaction tables.Comparison of processing time required for Full Snapshot versus Incremental Load strategies as data volume increases.{"layout": {"title": "Processing Time vs. Data Volume", "xaxis": {"title": "Days Since Inception (Data Growth)"}, "yaxis": {"title": "Processing Time (Minutes)"}, "showlegend": true, "plot_bgcolor": "#f8f9fa", "paper_bgcolor": "#f8f9fa"}, "data": [{"x": [1, 10, 20, 30, 40, 50, 60], "y": [5, 50, 100, 150, 200, 250, 300], "type": "scatter", "mode": "lines+markers", "name": "Full Snapshot", "line": {"color": "#fa5252", "width": 3}}, {"x": [1, 10, 20, 30, 40, 50, 60], "y": [5, 6, 5, 7, 6, 8, 7], "type": "scatter", "mode": "lines+markers", "name": "Incremental Load", "line": {"color": "#228be6", "width": 3}}]}Consistency and IsolationWhen ingesting data into the Bronze layer, it is important to define the isolation level. In modern data lake architectures using object storage, file writes are atomic. A file is either visible in its entirety or not at all. However, a batch job often consists of multiple files (multipart uploads).If a job fails halfway through, you may end up with a partially written partition. To mitigate this, standard engineering patterns include:Staging Directories: Write data to a temporary prefix (e.g., _temporary/) and conduct a purely metadata-based move (rename) operation to the final location upon completion. Note that on S3, rename operations are not atomic and mimic copy-delete, so this can be slow.Manifest Files: Use open table formats like Iceberg or Delta Lake, which write data files first and then atomically commit a metadata file (manifest) that points to the valid files. This ensures that consumers never see partial data.Write-Audit-Publish (WAP): A pattern where data is written to a hidden partition or branch, audited for quality (e.g., checking for null primary keys or row count anomalies), and then published to the main table.Handling Late Arriving DataIn incremental batch workflows, relying solely on a created_at timestamp can lead to data loss if records are inserted into the source database with a past timestamp after the batch window has closed.For example, if the batch job runs at 02:00 covering data up to 01:59, and a record is inserted at 02:01 with a transaction timestamp of 01:30 (due to system lag), a standard watermark filter on the transaction timestamp will miss this record in the next run. To address this, batch pipelines should use a system_inserted_at timestamp if available, or apply an overlap window (e.g., looking back 2-3 hours) and use deduplication logic in the subsequent processing layer to handle the re-ingested records.