Batch processing remains the primary mechanism for moving high volumes of historical and operational data into a data lake. While streaming architectures receive significant attention for their low-latency capabilities, batch workflows provide the throughput and reliability necessary for initializing datasets, performing nightly synchronizations, and ingesting data from sources that do not emit change events. A batch ingestion workflow involves extracting a discrete chunk of data from a source system, transporting it over the network, and persisting it into the storage layer (typically the Bronze or Raw layer) at scheduled intervals.
The design of a batch pipeline depends heavily on the nature of the source system and the business requirements for data freshness. We generally categorize these workflows into two primary patterns: Full Snapshots and Incremental Loads.
In a full snapshot approach, the pipeline extracts the entire dataset from the source table during every execution cycle and overwrites the target location in the data lake. This method ensures that the data in the lake matches the source exactly at the point of extraction, effectively handling hard deletes (records removed from the source) without complex logic.
This strategy is effective for dimension tables or reference datasets with low volume (e.g., fewer than 10 million rows) where the operational overhead of tracking changes outweighs the cost of simply reloading the data. However, as data scales, this approach becomes computationally expensive and network-intensive.
Where is the total size of the source dataset. As grows, the time to ingest grows linearly, eventually exceeding the available batch window.
To handle large fact tables or transaction logs, engineers employ incremental loading. This method extracts only the records that have been created or modified since the last successful execution. This requires a reliable monotonic tracking column in the source system, often referred to as a "high-water mark" or cursor. Common candidates include auto-incrementing primary keys or updated_at timestamps.
The logic for an incremental extraction can be expressed as:
Implementing this requires state management. The orchestration engine (such as Airflow or Dagster) must persistently store the timestamp of the last successful extraction () and pass it as a parameter to the subsequent job.
Workflow logic for an incremental batch job using a high-water mark strategy.
A critical mistake in batch ingestion is writing data without a predefined directory structure. If a pipeline dumps thousands of files into a single S3 prefix or Azure Blob container, downstream query engines must scan the entire file list to find relevant data.
To optimize for future retrieval, batch jobs should write data using Hive-style partitioning. This involves structuring the file path to include column names and values, typically based on the ingestion date or the event date.
For example, instead of writing to:
s3://my-lake/sales/batch_job_123.parquet
The pipeline should write to:
s3://my-lake/sales/ingest_date=2023-10-27/part-001.parquet
This structure allows query engines to perform "partition pruning," ignoring folders that do not match the query predicates. When designing the batch workflow, the writer must be configured to dynamically determine the partition path based on the data content or the execution date.
There is a direct trade-off between data freshness (how often the batch runs) and storage efficiency. Running batch jobs too frequently results in the "small file problem," where the overhead of opening and closing many small files degrades query performance. Running them too infrequently results in stale data.
For a standard batch workflow, the objective is to produce files that are large enough to be efficient for columnar readers (ideally between 128MB and 1GB) but frequent enough to meet Service Level Agreements (SLAs).
The following chart illustrates the efficiency divergence between full snapshots and incremental loads as the dataset grows over time. While snapshots are simpler to implement, their resource consumption makes them unsustainable for main transaction tables.
Comparison of processing time required for Full Snapshot versus Incremental Load strategies as data volume increases.
When ingesting data into the Bronze layer, it is important to define the isolation level. In modern data lake architectures using object storage, file writes are atomic. A file is either visible in its entirety or not at all. However, a batch job often consists of multiple files (multipart uploads).
If a job fails halfway through, you may end up with a partially written partition. To mitigate this, standard engineering patterns include:
_temporary/) and conduct a purely metadata-based move (rename) operation to the final location upon completion. Note that on S3, rename operations are not atomic and mimic copy-delete, so this can be slow.In incremental batch workflows, relying solely on a created_at timestamp can lead to data loss if records are inserted into the source database with a past timestamp after the batch window has closed.
For example, if the batch job runs at 02:00 covering data up to 01:59, and a record is inserted at 02:01 with a transaction timestamp of 01:30 (due to system lag), a standard watermark filter on the transaction timestamp will miss this record in the next run. To address this, batch pipelines should use a system_inserted_at timestamp if available, or apply an overlap window (e.g., looking back 2-3 hours) and use deduplication logic in the subsequent processing layer to handle the re-ingested records.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•