Ingesting data at the speed of generation is often an architectural aspiration rather than a strict business requirement. Reliable change data capture (CDC) is a method for capturing modifications to source data. The mechanism chosen to deliver these captured changes to the Data Warehouse determines the efficiency of the entire platform. The fundamental tension in MPP systems lies between data freshness (latency) and storage efficiency (throughput).
In traditional transaction processing systems (OLTP), inserting rows individually is standard practice. However, modern data warehouses use columnar storage formats. These systems are optimized for read-heavy analytical workloads, relying on heavy compression and metadata pruning. Writing small files or single rows creates a phenomenon known as the "small file problem," where the overhead of managing metadata for millions of tiny files exceeds the processing time of the data itself.
The choice between micro-batch and streaming is fundamentally an economic decision regarding compute resources and storage health. As you reduce the time between data generation and data availability, the cost of ingestion rises non-linearly.
Streaming ingestion architectures attempt to make data available immediately, typically within seconds or milliseconds. Micro-batch architectures accumulate records into a buffer and load them in bulk at set intervals, typically ranging from 5 to 60 minutes.
We can model the ingestion cost function relative to latency roughly as an inverse relationship:
Where is data volume and represents fixed infrastructure overhead. As approaches zero, the system must keep compute resources constantly active to listen for incoming events, preventing the auto-suspension or scaling down of resources that makes cloud warehousing cost-effective.
Cost rises exponentially as latency requirements tighten below the 5-minute mark. The dotted line represents the negative impact on storage optimization, such as reduced compression ratios and increased partition counts.
Micro-batching is the default and often most effective pattern for high-throughput data warehousing. It aligns with the nature of MPP systems, which favor bulk operations. In this model, an orchestration layer (like Airflow or Dagster) or an ingestion tool accumulates data in an object store (S3, GCS, Azure Blob) before issuing a bulk load command, such as COPY INTO.
The primary advantage of micro-batching is idempotency and observability. If a batch fails, the entire file can be reprocessed. Furthermore, creating larger files allows the data warehouse to compress columnar blocks more effectively.
To implement efficient micro-batches, you must tune two parameters:
The trigger for a load becomes:
This approach prevents small batches during low-traffic periods while ensuring buffers do not overflow during demand spikes.
Even with micro-batching, you may eventually end up with suboptimal file sizes if your batches are too frequent. A common pattern is to perform "compaction" or "vacuuming" on the background. However, modern platforms like Snowflake and BigQuery now handle this largely automatically, provided the ingestion files are not drastically small (e.g., avoiding 1KB files).
True streaming in a data warehouse does not mean executing INSERT INTO table VALUES (...) for every event. That approach locks table metadata and creates significant contention. Instead, modern platforms offer specialized Streaming APIs (e.g., Snowflake Snowpipe Streaming, BigQuery Storage Write API).
These APIs differ from standard SQL insertion. They typically write to a row-oriented write-ahead log (WAL) or a temporary buffer optimized for high-concurrency writes. A background process explicitly managed by the vendor then asynchronously migrates this data from the buffer to the optimized columnar storage.
Data flows from event buses into a specialized row-store buffer within the warehouse. The background merger asynchronously converts these rows into optimized columnar micro-partitions to maintain read performance.
Achieving idempotency in streaming is significantly harder than in micro-batching. In a file-based micro-batch, the filename acts as a natural deduplication identifier. In streaming, you often rely on offset tracking.
When using streaming APIs, the application or connector must track the offset of the last successfully committed record. If the stream disconnects, the producer effectively "rewinds" to the last committed offset. However, if the commit acknowledgment was lost due to network failure, the producer might resend data that was already written, leading to duplication.
To mitigate this, advanced ingestion designs utilize a "deduplication window" in the destination table or rely on deterministic primary keys to merge updates, though this adds compute overhead to the read side.
When architecting your pipeline, use the following criteria to distinguish between the necessity of streaming and the sufficiency of micro-batching.
Choose Micro-batching (15-60 mins) if:
Choose Micro-batching (1-5 mins) if:
Choose Streaming (< 1 min) if:
In the context of this course, we emphasize that streaming should not be the default. It is an optimization for specific high-value, low-latency datasets. For the majority of analytical workloads, micro-batch pipelines provide the best balance of stability, cost, and performance.
A critical aspect often overlooked in streaming ingestion is schema evolution. In a batch process, if a column is added to the source, the batch load might fail, alerting an engineer to update the schema. The impact is contained to that batch.
In a continuous stream, a schema mismatch can poison the pipeline or cause the consumer to drop messages entirely. To handle this, high-throughput pipelines often ingest data into a VARIANT or JSON column type first (Semi-structured Data). This allows the pipeline to succeed regardless of schema changes. The structuring and typing of data are then deferred to a downstream view or transformation process, a pattern commonly referred to as "Schema-on-Read". This technique decouples the stability of the ingestion infrastructure from the volatility of the application data model.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with