Micro-batch vs Streaming Ingestion

Ingesting data at the speed of generation is often an architectural aspiration rather than a strict business requirement. Reliable change data capture (CDC) is a method for capturing modifications to source data. The mechanism chosen to deliver these captured changes to the Data Warehouse determines the efficiency of the entire platform. The fundamental tension in MPP systems lies between data freshness (latency) and storage efficiency (throughput).

In traditional transaction processing systems (OLTP), inserting rows individually is standard practice. However, modern data warehouses use columnar storage formats. These systems are optimized for read-heavy analytical workloads, relying on heavy compression and metadata pruning. Writing small files or single rows creates a phenomenon known as the "small file problem," where the overhead of managing metadata for millions of tiny files exceeds the processing time of the data itself.

The Economics of Ingestion Latency

The choice between micro-batch and streaming is fundamentally an economic decision regarding compute resources and storage health. As you reduce the time between data generation and data availability, the cost of ingestion rises non-linearly.

Streaming ingestion architectures attempt to make data available immediately, typically within seconds or milliseconds. Micro-batch architectures accumulate records into a buffer and load them in bulk at set intervals, typically ranging from 5 to 60 minutes.

We can model the ingestion cost function $C(l)$ relative to latency $l$ roughly as an inverse relationship:

$C(l) \propto \frac{V}{l} + K$

Where $V$ is data volume and $K$ represents fixed infrastructure overhead. As $l$ approaches zero, the system must keep compute resources constantly active to listen for incoming events, preventing the auto-suspension or scaling down of resources that makes cloud warehousing cost-effective.

Cost rises exponentially as latency requirements tighten below the 5-minute mark. The dotted line represents the negative impact on storage optimization, such as reduced compression ratios and increased partition counts.

Micro-batch Architecture

Micro-batching is the default and often most effective pattern for high-throughput data warehousing. It aligns with the nature of MPP systems, which favor bulk operations. In this model, an orchestration layer (like Airflow or Dagster) or an ingestion tool accumulates data in an object store (S3, GCS, Azure Blob) before issuing a bulk load command, such as COPY INTO.

The primary advantage of micro-batching is idempotency and observability. If a batch fails, the entire file can be reprocessed. Furthermore, creating larger files allows the data warehouse to compress columnar blocks more effectively.

To implement efficient micro-batches, you must tune two parameters:

Time Window ( $t$ ): The maximum duration to wait before flushing the buffer.
Size Threshold ( $s$ ): The maximum file size (e.g., 100MB - 250MB compressed) to reach before flushing.

The trigger for a load becomes:

$Trigger = (T_{current} - T_{last\_load} \geq t) \lor (Size_{buffer} \geq s)$

This approach prevents small batches during low-traffic periods while ensuring buffers do not overflow during demand spikes.

Handling File Compaction

Even with micro-batching, you may eventually end up with suboptimal file sizes if your batches are too frequent. A common pattern is to perform "compaction" or "vacuuming" on the background. However, modern platforms like Snowflake and BigQuery now handle this largely automatically, provided the ingestion files are not drastically small (e.g., avoiding 1KB files).

Streaming Ingestion Implementation

True streaming in a data warehouse does not mean executing INSERT INTO table VALUES (...) for every event. That approach locks table metadata and creates significant contention. Instead, modern platforms offer specialized Streaming APIs (e.g., Snowflake Snowpipe Streaming, BigQuery Storage Write API).

These APIs differ from standard SQL insertion. They typically write to a row-oriented write-ahead log (WAL) or a temporary buffer optimized for high-concurrency writes. A background process explicitly managed by the vendor then asynchronously migrates this data from the buffer to the optimized columnar storage.

Data flows from event buses into a specialized row-store buffer within the warehouse. The background merger asynchronously converts these rows into optimized columnar micro-partitions to maintain read performance.

The Complexity of Exactly-Once Streaming

Achieving idempotency in streaming is significantly harder than in micro-batching. In a file-based micro-batch, the filename acts as a natural deduplication identifier. In streaming, you often rely on offset tracking.

When using streaming APIs, the application or connector must track the offset of the last successfully committed record. If the stream disconnects, the producer effectively "rewinds" to the last committed offset. However, if the commit acknowledgment was lost due to network failure, the producer might resend data that was already written, leading to duplication.

To mitigate this, advanced ingestion designs utilize a "deduplication window" in the destination table or rely on deterministic primary keys to merge updates, though this adds compute overhead to the read side.

Decision Matrix: Selecting the Pattern

When architecting your pipeline, use the following criteria to distinguish between the necessity of streaming and the sufficiency of micro-batching.

Choose Micro-batching (15-60 mins) if:

Data is primarily used for daily or hourly reporting.
Cost control is a primary concern.
Transformation logic involves complex joins that require a stable snapshot of data.
You require simple replayability of historical data via file reloading.

Choose Micro-batching (1-5 mins) if:

Business teams require "near real-time" dashboards but can tolerate slight delays.
You want to avoid the complexity of managing dedicated streaming infrastructure (like Kafka Connect clusters).
The platform allows frequent file loads without massive metadata penalties.

Choose Streaming (< 1 min) if:

The use case involves operational monitoring, fraud detection, or live inventory management.
Downstream applications trigger automated actions based on arriving events immediately.
The engineering team has the capacity to manage offset monitoring and dead-letter queues (DLQs) for failed events.

In the context of this course, we emphasize that streaming should not be the default. It is an optimization for specific high-value, low-latency datasets. For the majority of analytical workloads, micro-batch pipelines provide the best balance of stability, cost, and performance.

Managing Schema Drift in Streams

A critical aspect often overlooked in streaming ingestion is schema evolution. In a batch process, if a column is added to the source, the batch load might fail, alerting an engineer to update the schema. The impact is contained to that batch.

In a continuous stream, a schema mismatch can poison the pipeline or cause the consumer to drop messages entirely. To handle this, high-throughput pipelines often ingest data into a VARIANT or JSON column type first (Semi-structured Data). This allows the pipeline to succeed regardless of schema changes. The structuring and typing of data are then deferred to a downstream view or transformation process, a pattern commonly referred to as "Schema-on-Read". This technique decouples the stability of the ingestion infrastructure from the volatility of the application data model.

Was this section helpful?

References

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, Martin Kleppmann, 2017 (O'Reilly Media) - Explores core concepts of batch and stream processing, distributed systems, consistency models, and data storage, providing a comprehensive background for understanding ingestion challenges.
C-Store: A Column-Oriented DBMS, Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J. O'Neil, Patrick E. O'Neil, Alexander Rasin, Nga Tran, Stanley B. Zdonik, 2005 Proceedings of the 31st International Conference on Very Large Data Bases (VLDB) (ACM) DOI: 10.14778/1083592.1083658 - Introduces the design principles of column-oriented database management systems, which are foundational to modern analytical data warehouses and explain their optimization for read-heavy workloads and the implications for writing.
Snowpipe Streaming API, Snowflake, 2024 (Snowflake) - Provides practical details on how a leading cloud data warehouse implements low-latency, high-throughput streaming ingestion into columnar storage, addressing the row-oriented write-ahead log (WAL) and background merger described.
The Log: What every software engineer should know about real-time data's unifying abstraction, Jay Kreps, 2013 (LinkedIn Engineering Blog) - Explains the foundational role of transaction logs and change data capture (CDC) in modern data systems, linking directly to the concept of 'reliable change data capture (CDC)' in the section.